Session: Distributed Learning for Machine Learning
Chair: Hoi To Wai
Cluster: Multi-agent Optimization and Games
Talk 1: Why batch normalization damage federated learning on non-iid data?
Speaker: Yanmeng Wang
Abstract: As a promising distributed learning paradigm, federated learning (FL) involves training deep neural network (DNN) models at the network edge while protecting the privacy of the edge clients. To train a large-scale DNN model, batch normalization (BN) has been regarded as a simple and effective means to accelerate the training and improve the generalization capability. However, recent findings indicate that BN can significantly impair the performance of FL in the presence of non-i.i.d. data. While several FL algorithms have been proposed to address this issue, their performance still falls significantly when compared to the centralized scheme. Furthermore, none of them have provided a theoretical explanation of how the BN damages the FL convergence. In this work, we present the first convergence analysis to show that under the non-i.i.d. data, the mismatch between the local and global statistical parameters in BN causes the gradient deviation between the local and global models, which, as a result, slows down and biases the FL convergence. In view of this, we develop a new FL algorithm that is tailored to BN, called FedTAN, which is capable of achieving robust FL performance under a variety of data distributions via iterative layer-wise parameter aggregation. Comprehensive experimental results demonstrate the superiority of the proposed FedTAN over existing baselines for training BN-based DNN models.
Talk 2: Fast Two-Time-Scale Stochastic Gradient Method with Applications in Reinforcement Learning
Speaker: Thinh Doan
Abstract: Two-time-scale optimization is a framework that abstracts a range of policy evaluation and policy optimization problems in reinforcement learning (RL). Akin to bi-level optimization under a particular type of stochastic oracle, the two-time-scale optimization framework has an upper level objective whose gradient evaluation depends on the solution of a lower level problem, which is to find the root of a strongly monotone operator. In this work, we propose a new method for solving two-time-scale optimization that achieves significantly faster convergence than the prior arts. The key idea of our approach is to leverage an averaging step to improve the estimates of the operators in both lower and upper levels before using them to update the decision variables. These additional averaging steps eliminate the direct coupling between the main variables, enabling the accelerated performance of our algorithm. We characterize the finite-time convergence rates of the proposed algorithm under various conditions of the underlying objective function, including strong convexity, convexity, Polyak-Lojasiewicz condition, and general non-convexity. These rates significantly improve over the best-known complexity of the standard two-time-scale stochastic approximation algorithm. When applied to RL, we show how the proposed algorithm specializes to novel online sample-based methods that surpass or match the performance of the existing state of the art. Finally, we support our theoretical results with numerical simulations in RL.
Talk 3: Two-timescale Stochastic Primal-dual Algorithms for Decentralized Optimization with Communication Compression
Speaker: Hoi-To Wai
Abstract: This talk presents a two-timescale compressed primal-dual (TiCoPD) algorithm for decentralized optimization. The algorithm is built upon the primal-dual optimization framework developed by a majorization-minimization procedure. The latter naturally suggests the agents to share a compressed difference term during the iteration. Furthermore, the TiCoPD algorithm incorporates a fast timescale mirror sequence for agent consensus on nonlinearly compressed terms with noise, together with a slow timescale primal-dual recursion for optimizing the objective function. We show that the TiCoPD algorithm converges with a constant step size. It also finds an O(1/sqrt{T}) stationary solution after T iterations. Numerical experiments on decentralized training of a neural network validate the efficacy of TiCoPD algorithm.