Scalable and Efficient Machine Learning as a Service

Loading...
Thumbnail Image

Authors

Qin, Heyang

Issue Date

2022

Type

Dissertation

Language

Keywords

Research Projects

Organizational Units

Journal Issue

Alternative Title

Abstract

Driven by the sustained advances of machine learning and its application to multiple domains ranging from image recognition, text prediction to translation and autonomous driving, the past few years have witnessed a surging demand for Machine-Learning-as-a-Service (MLaaS). MLaaS is an emerging computing paradigm that facilitates machine learning model design, model training, inference serving and provides optimized executions of machine learning tasks in an automated, scalable, and efficient manner.This dissertation proposes three novel approaches for MLaaS, namely SimiGrad, DistQuant, and RRL, to improve the scale and efficiency of MLaaS training and inference, respectively.For MLaaS training, we propose SimiGrad, a fine-grained adaptive batching approach for large scale training using gradient similarity measurement. Large scale training requires massive parallelism to finish the training within a reasonable amount of time. To support massive parallelism, large batch training is the key enabler but often at the cost of generalization performance. We propose a fully automated and lightweight adaptive batching methodology to enable fine-grained batch size adaption (e.g., at a mini-batch level) that can achieve state-of-the-art performance with record breaking batch sizes. The core component of our method is a lightweight yet efficient representation of the critical gradient noise information. We open-source the proposed methodology and extensive evaluations on popular benchmarks (e.g., CIFAR10, ImageNet, and BERT-Large) demonstrate that the proposed methodology outperforms state-of-the-art methodologies using adaptive batching approaches or hand-tuned static strategies in both performance and batch size. Particularly, we achieve a new state-of-the-art batch size of 78K in BERT-Large pretraining with a SQuAD score of 90.69 compared to 90.58 reported in previous state-of-the-art with 59K batch size.Another key challenge for MLaaS training is the communication cost which limits how much the training can scale. Quantization is a popular method for reducing communication cost yet it imposes non-trivial encoding and decoding overheads and may lead to degraded model performance. Our key observation is that model weights are partitioned and cached in GPU memory in common distributed training methods such as model and pipeline parallelism. If quantization can be performed on the partitioned weights in parallel while cached in GPU memory, the quantization speed can be significantly improved and we can further reduce the communication overhead for weights gathering. To this end, we propose DistQuant, a distributed quantization scheme for compressing partitioned weights during distributed training. DistQuant preserves model performance by canceling out the noise introduced by quantization and is transparent to training pipelines. We both theoretically and empirically show that DistQuant can achieve much higher precision than state-of-the-art quantization approaches. Evaluation on large-scale models including BERT and GPT2 indicates that DistQuant reduces the communication cost of MLaaS training by half without compromising model performance.For MLaaS serving, we propose RRL, a swift machine learning model serving system powered by a region-based reinforcement learning approach. To meet latency Service-Level-Objective (SLO), judicious parallelization at both request and operation levels is utterly important. However, existing ML systems (e.g., Tensorflow) and cloud ML serving platforms (e.g., SageMaker) are SLO-agnostic and rely on users to manually configure the parallelism. To provide low latency MLaaS serving, we propose a swift machine learning serving scheduling framework with a novel Region-based Reinforcement Learning (RRL) approach. RRL can efficiently identify the optimal parallelism configuration under different workloads by estimating performance of similar configurations with that of the known ones. We both theoretically and experimentally show that the RRL approach can outperform state-of-the-art approaches by finding near-optimal solutions over 8 times faster while reducing inference latency up to 79.0% and reducing SLO violation up to 49.9%.

Description

Citation

Publisher

License

Creative Commons Attribution 4.0 United States

Journal

Volume

Issue

PubMed ID

DOI

ISSN

EISSN