Towards the Next Generation Highly Scalable Distributed Machine Learning

Loading...
Thumbnail Image

Authors

YI, JUN

Issue Date

2022

Type

Dissertation

Language

Keywords

Cloud deployment , Distributed training , Federated Learning , Gradient compression , Graph feature compression , Machine-Learning-as-a-Service

Research Projects

Organizational Units

Journal Issue

Alternative Title

Abstract

To support large-scale machine learning, distributed training is a promising approach as large-scale machine learning is both resource and time consuming. Machine-Learning-as-a-Service (MLaaS), as one of the next generation computing platforms, enables practitioners and AI service providers to train and deploy ML models in the cloud using diverse and scalable compute resources. Federated Learning (FL), another promising next generation computing platform, is a distributed machine learning technique which allows a model to be trained over data that is not directly seen by third-parties due to training being performed in-place with the data owners. Gradient compression is a promising general approach to alleviating the communication bottleneck in data parallel Deep Neural Network (DNN) training by significantly reducing the data volume of gradients for synchronization. The emerging Graph Neural Networks (GNNs) usually have larger memory footprints compared to DNNs. Graph feature compression is a promising approach to accelerate GNN training by significantly reducing the memory footprint and PCIe bandwidth requirement so that GNNs can take full advantage of GPU computing capabilities. This dissertation aims to make distributed machine learning systems more scalable through both general approaches (such as gradient compression and feature compression) and next generation platform specific approaches (such as MLaaS and FL).A common problem for MLaaS users is to choose from a variety of training deployment options, notably scale-up (using more capable instances) and scale-out (using more instances), subject to the budget limits and/or time constraints. State-of-the-art (SOTA) approaches employ analytical modeling for finding the optimal deployment strategy. However, they have limited applicability as they must be tailored to specific ML model architectures, training framework, and hardware. To quickly adapt to the fast-evolving design of ML models and hardware infrastructure, we propose a new Bayesian Optimization (BO) based method HeterBO for exploring the optimal deployment of training jobs. Unlike the existing BO approaches for general applications, we consider the heterogeneous exploration cost and machine learning specific prior to significantly improve the search efficiency. This approach culminates in a fully automated MLaaS training Cloud Deployment system (MLCD) driven by the highly efficient HeterBO search method. While gradient compression is being actively adopted by the industry (e.g., Facebook and AWS), our study reveals that there are two critical but often overlooked challenges: 1) inefficient coordination between compression and communication during gradient synchronization incurs substantial overheads, and 2) developing, optimizing, and integrating gradient compression algorithms into DNN systems impose heavy burdens on DNN practitioners, and ad-hoc compression implementations often yield surprisingly poor system performance. To address these issues, we first propose a compression-aware gradient synchronization architecture, CaSync, which relies on a flexible composition of basic computing and communication primitives. It is general and compatible with any gradient compression algorithms and gradient synchronization strategies, and enables high-performance computation communication pipelining. We further introduce a gradient compression toolkit, CompLL, to enable efficient development and automated integration of on-GPU compression algorithms into DNN systems with little programming burden. Lastly, we build a compression-aware DNN training framework HiPress with CaSync and CompLL. Federated Learning (FL) is a new machine learning paradigm that enables training models collaboratively across clients without sharing private data. In FL, data is non-uniformly distributed among clients (i.e., data heterogeneity) and cannot be balanced nor monitored like in conventional ML. Such data heterogeneity and privacy requirements bring unique challenges for learning hyperparameter optimization as the training dynamics change across clients even within the same training round and they are difficult to measure due to privacy constraints. State-of-the-art hyperparameter optimization works in FL either adopt a “global” tuning method that uses a single set of learning hyperparameters across all the clients, or perform the hyperparameter optimization on the client devices, which imposes significant overheads and power consumption to the devices. To address the prohibitively expensive cost challenge, we explore the possibility of offloading hyperparameter customization to servers. We observe that hyperparameter customization improves FL model accuracy but incurs extensive cost on the client devices. To tackle this issue, we propose a novel framework called FedTune that offloads expensive hyperparameter customization cost from the client devices to the central server without violating the privacy constraints. To be able to perform hyperparameter customization without accessing client data, FedTune introduces a proxy data based hyperparameter transfer approach that decides customized hyperparameters for client devices based on intrinsic metrics that capture data heterogeneity. Furthermore, to make the hyperparameter customization process scalable, FedTune employs a Bayesian-strengthened tuner to significantly accelerates the hyperparameter customization speed. Extensive evaluation demonstrates that FedTune achieves better accuracy than the widely adopted globally tuned method for popular FL benchmarks FEMNIST, Cifar100, Cifar10, and Fashion-MNIST respectively, while being scalable and reducing computation, memory, and energy consumption on the client devices, all at no cost of compromising privacy constraints. Different from DNNs (Deep Neural Networks), GNNs (Graph Neural Networks) usually have larger memory footprints, and thus the GPU memory capacity and PCIe bandwidth are the main resource bottlenecks in GNN training. To address this problem, we present BiFeat: a graph feature quantization methodology to accelerate GNN training by significantly reducing the memory footprint and PCIe bandwidth requirement so that GNNs can take full advantage of GPU computing capabilities. Our key insight is that unlike DNN, GNN is less prone to the information loss of input features caused by quantization. We identify the main accuracy impact factors in graph feature quantization and theoretically prove that BiFeat training converges to a network where the loss is within ϵ of the optimal loss of uncompressed network. We perform extensive evaluation of BiFeat using several popular GNN models and datasets, including GraphSAGE on MAG240M, the largest public graph dataset.

Description

Citation

Publisher

License

Creative Commons Attribution-ShareAlike 4.0 United States

Journal

Volume

Issue

PubMed ID

DOI

ISSN

EISSN