I/O Throughput Prediction for HPC Applications Using Darshan Logs

Loading...
Thumbnail Image

Authors

Gabriel, David James

Issue Date

2022

Type

Thesis

Language

Keywords

High Performance Computing , I/O , Machine Learning

Research Projects

Organizational Units

Journal Issue

Alternative Title

Abstract

As most High Performance Computing (HPC) applications deal with large volumes of data, I/O performance is of critical importance to optimize application performance. Despite having large-scale, high-performance parallel file systems, many applications still suffer from poor I/O performance. Although existing system monitoring tools gather performance statistics, it can be challenging to interpret multidimensional data, thereby distinguishing normal behavior from abnormal ones. Therefore, it is important to derive models that can process I/O statistics gathered by existing monitoring tools. In this thesis, I develop machine learning (ML) models to process file system statistics as reported by Darshan monitoring tool to predict I/O throughput of HPC applications, which then can be compared against the observed I/O throughput to identify performance issues. By processing Darshan logs of BlueWaters supercomputer, I trained several ML models including Decision Tree, Random Forest, Gradient Boosting Tree, and Deep Neural Network (DNN) using different feature scaling methods. I found that the DNN model outperformed other solutions as it can estimate the throughput of I/O operations within the 16 MB/s range. We believe that this work makes an important contribution to the field by deriving accurate models to process Darshan logs to detect file system performance anomalies (e.g., overloaded metadata server, high resource interference, etc.) that can be tackled in a timely manner to minimize interruptions.

Description

Citation

Publisher

License

Creative Commons Attribution 4.0 United States

Journal

Volume

Issue

PubMed ID

DOI

ISSN

EISSN