Towards Efficient AI for Science in Scalable and High Performance Distributed System
Loading...
Authors
Ma, Xiaolong
Issue Date
2024
Type
Dissertation
Language
en_US
Keywords
AI for Science , AI Infrastructure , Artificial intelligence , High Performance Computing , Machine learning , Serverless
Alternative Title
Abstract
Artificial intelligence (AI) has seen rapid development over the last few decades, significantly impacting various domains such as computer vision and natural language processing. In recent years, machine learning methods have been increasingly applied to the scientific discovery process, accelerating advances in diverse fields. Notable examples include AlphaFold, which predicts protein structures, and ClimateX, which enhances weather prediction capabilities. The ability of AI to process large volumes of data, recognize complex patterns with precision, and uncover intricate relationships has made it an indispensable tool for innovation across scientific domains. Scientific research often demands extensive data processing and computation, typically facilitated by high-performance computing (HPC) clusters or major cloud providers such as AWS, Azure, and GCP. However, efficiently leveraging these infrastructures for accelerated scientific discovery poses significant challenges. This dissertation addresses the efficient management of AI-driven science workloads on scalable, high-performance distributed systems. Motivated by the requirements of extensive machine learning applications and the need to effectively handle large scientific datasets, this dissertation develops novel frameworks aimed at optimizing resource allocation on supercomputers, minimizing storage costs in the cloud, and harnessing scalable serverless resources for machine learning training. Additionally, this dissertation introduces an AI application for climate research, employing diffusion models for super-resolution and data assimilation to enhance climate prediction accuracy. We first address the challenges prevalent in high-performance computing by analyzing supercomputer clusters at DOE National Laboratories. Our findings indicate that roughly 10\% of the node resources in these clusters, including major installations like Aurora at Argonne National Laboratory (over 10,000 nodes) and Summit at Oak Ridge National Laboratory (over 4,000 nodes), remain unutilized by the main scheduler. Such underutilization represents a significant loss of computational potential. To address these challenges, we develop a framework named MalleTrain, which efficiently utilizes these otherwise wasted dynamic resources for scalable data-parallel distributed deep learning training. Next, we investigate AI for science challenges in cloud computing, specifically the emerging paradigm of serverless computing. Our investigation includes examining serverless function instances for data caching. We introduce the InfiniCache framework, an in-memory caching system. Compared to AWS ElastiCache, InfiniCache achieves cost savings of 31 to 96 times for large object caching without compromising performance. We further explore the potential of using serverless computing for machine learning training. We introduce SMLT, a user-centric framework designed to facilitate scalable and adaptive machine learning training on public cloud platforms using serverless technologies. Finally, we introduce WindSR, a diffusion-based framework tailored for wind speed super-resolution that innovatively incorporates data assimilation. This framework enables seamless integration of data assimilation techniques into a diffusion-based super-resolution model.