I/O Throughput Prediction for HPC Applications Using Darshan Logs
Loading...
Authors
Gabriel, David James
Issue Date
2022
Type
Thesis
Language
Keywords
High Performance Computing , I/O , Machine Learning
Alternative Title
Abstract
As most High Performance Computing (HPC) applications deal with large volumes of data, I/O performance is of critical importance to optimize application performance. Despite having large-scale, high-performance parallel file systems, many applications still suffer from poor I/O performance. Although existing system monitoring tools gather performance statistics, it can be challenging to interpret multidimensional data, thereby distinguishing normal behavior from abnormal ones. Therefore, it is important to derive models that can process I/O statistics gathered by existing monitoring tools. In this thesis, I develop machine learning (ML) models to process file system statistics as reported by Darshan monitoring tool to predict I/O throughput of HPC applications, which then can be compared against the observed I/O throughput to identify performance issues. By processing Darshan logs of BlueWaters supercomputer, I trained several ML models including Decision Tree, Random Forest, Gradient Boosting Tree, and Deep Neural Network (DNN) using different feature scaling methods. I found that the DNN model outperformed other solutions as it can estimate the throughput of I/O operations within the 16 MB/s range. We believe that this work makes an important contribution to the field by deriving accurate models to process Darshan logs to detect file system performance anomalies (e.g., overloaded metadata server, high resource interference, etc.) that can be tackled in a timely manner to minimize interruptions.
Description
Citation
Publisher
License
Creative Commons Attribution 4.0 United States