Distributed Machine Learning in Heterogeneous Edge Networks

Loading...
Thumbnail Image

Authors

Sajjadi Mohammadabadi, Seyed Mahmoud

Issue Date

2025

Type

Dissertation

Language

en_US

Keywords

Research Projects

Organizational Units

Journal Issue

Alternative Title

Abstract

The development of edge devices (e.g., smartphones, IoT sensors, and wearables) has led to a surge in data generated at the network's edge. Meanwhile, the complexity of machine learning models has increased significantly, with state-of-the-art models for tasks like natural language processing and computer vision now containing billions of parameters.Distributed machine learning addresses the challenges posed by massive datasets and high computational demands by distributing the training process across multiple devices. This approach enables parallel computation and reduces the overall computational burden. However, distributed machine learning poses unique challenges, including the straggler problem (i.e., where slow devices hinder training progress), communication overhead (i.e., the high cost of transferring data between devices), and device heterogeneity (i.e., variability in device capabilities). This dissertation explores the challenges of distributed machine learning in heterogeneous edge networks, focusing on the design and optimization of algorithms to accelerate training while enhancing resource efficiency. This work's key contributions include the development of Dynamic Tiering-based Federated Learning (DTFL), Communication-Efficient Decentralized Multi-Agent Learning (ComDML). DTFL is a federated learning algorithm designed to address the inherent heterogeneity of edge networks, where devices vary widely in computational power, communication bandwidth, and task sizes. By dynamically assigning clients to tiers based on their capabilities, DTFL mitigates the straggler problem and accelerates training. Clients in each tier offload portions of the global model to a central server, enabling parallel updates through split learning and local-loss-based training. A dynamic tier scheduler continuously profiles clients, estimating training times based on observed resource metrics such as network speed and dataset size, and adjusts tier assignments accordingly. This low-overhead approach prevents straggler problems and ensures efficient resource utilization in dynamic environments.Extensive experiments with DTFL on large models such as ResNet-56 and ResNet-110, using datasets like CIFAR-10, CIFAR-100, CINIC-10, and HAM10000, validate its efficacy. Results demonstrate up to an 80\% reduction in training time compared to advanced federated learning methods while maintaining model accuracy. DTFL reduces training time in both IID and non-IID data settings and maintains high performance even under privacy-preserving measures. Theoretical analysis further demonstrates its convergence properties across convex and non-convex loss functions. ComDML extends DML to decentralized scenarios, eliminating the need for a central aggregator. By doing so, ComDML operates in peer-to-peer configurations, enhancing resilience and security by removing single points of failure. In heterogeneous networks with varying computational and communication resources, ComDML optimizes training efficiency by enabling slower agents to offload tasks to faster ones, ensuring a more balanced distribution of workloads. A decentralized pairing scheduler dynamically matches agents based on their real-time capabilities using lightweight profiling to assess communication overhead, computation capacity, and task size for each pairing, thereby reducing idle times for faster agents. Through local-loss-based split training, where agents train different portions of a model concurrently, ComDML mitigates the communication synchronization bottlenecks typical of traditional split learning. To further enhance communication efficiency during workload offloading, ComDML incorporates SplitPair, a technique that compresses the intermediate feature maps exchanged between paired agents using Singular Value Decomposition (SVD). SplitPair applies a dynamic rank adjustment mechanism that starts with a low SVD rank for initial communication savings and increases it only when model accuracy plateaus, optimizing the trade-off between communication cost and model fidelity throughout training. Experimental results show significant improvements, reducing training time by up to 71% while maintaining model accuracy. ComDML can seamlessly integrate privacy-preserving techniques, such as differential privacy, without substantial performance loss, demonstrating its robustness in dynamic, edge network environments.

Description

Citation

Publisher

License

Journal

Volume

Issue

PubMed ID

DOI

ISSN

EISSN