Session: Adaptive Stochastic Gradient Methods
Chair: Lin Xiao
Cluster: Nonlinear Optimization
Talk 1: The Road Less Scheduled
Speaker: Aaron Defazio
Abstract: Schedule-Free learning algorithms allow for the training of models in an any-time fashion, without compromising on speed, memory or final test metrics. I will dive into the details of how Schedule-Free learning works and show how it provides further quality-of-life improvements to practitioners, and provide details of our winning entry to the AlgoPerf algorithmic efficiency optimization challenge that used Schedule-Free AdamW.
Talk 2: Analyzing AdaGrad Under Anisotropic Smoothness Assumptions
Speaker: Yuxing Liu
Abstract: Adaptive gradient methods have demonstrated remarkable success for training large-scale deep neural networks. However, the theoretical understanding of these methods, particularly in the large batch size regime (which is commonly used in practice), remains limited. In this talk, we aim to address this gap by introducing a generalized anisotropic smoothness assumption that better reflects the behavior of modern neural network training. Our theoretical analysis reveals that AdaGrad achieves provably faster convergence compared to standard gradient methods, even when large batch sizes are employed. These results provide valuable theoretical insights into the practical efficacy of adaptive gradient methods.
Talk 3: A Novel Approach to Loss Landscape Characterization without Over-Parametrization
Speaker: Antonio Orvieto
Abstract: Modern machine learning heavily depends on the effectiveness of optimization techniques. While deep learning models have achieved remarkable empirical results in training, their theoretical underpinnings remain somewhat elusive. Ensuring the convergence of optimization methods requires imposing specific structures on the objective function, which often do not hold in practice. One prominent example is the widely recognized Polyak-Lojasiewicz (PL) inequality, which has garnered considerable attention in recent years. However, validating such assumptions for deep neural networks entails substantial and often impractical levels of over-parametrization. In order to address this limitation, we propose a novel class of functions that can characterize the loss landscape of modern deep models without requiring extensive over-parametrization and can also include saddle points. Crucially, we prove that gradient-based optimizers possess theoretical guarantees of convergence under this assumption. Finally, we validate the soundness of our assumption through both theoretical analysis and empirical experimentation across a diverse range of deep learning models.