Session: First order methods in machine learning
Chair: Yaoliang Yu
Cluster: Optimization Applications (Communication, Energy, Health, ML, ...)
Talk 1: Independently-Normalized SGD for Generalized-Smooth Nonconvex Optimization
Speaker: Yufeng Yang
Abstract: Recent studies have shown that many nonconvex machine learning problems meet a so-called generalized-smooth condition that extends beyond traditional smooth nonconvex optimization. However, the existing algorithms designed for generalized-smooth nonconvex optimization encounter significant limitations in both their design and convergence analysis. In this work, we first study deterministic generalized-smooth nonconvex optimization and analyze the convergence of normalized gradient descent under the generalized Polyak-Lojasiewicz condition. Our results provide a comprehensive understanding of the interplay between gradient normalization and function geometry. Then, for stochastic generalized-smooth nonconvex optimization, we propose an independently-normalized stochastic gradient descent algorithm, which leverages independent sampling, gradient normalization, and clipping to achieve the standard sample complexity under relaxed assumptions. Experiments demonstrate the fast convergence of our algorithm.
Talk 2: Why does the Adam algorithm work so well for training large language models?
Speaker: Mark Schmidt
Abstract: The success of the Adam optimizer on a wide array of architectures has made it the default in settings where stochastic gradient descent (SGD) performs poorly. However, it is unclear why the gap between Adam and SGD is often big for large language models (LLMs) but small for computer vision benchmarks. Recent work proposed that Adam works better for LLMs due to heavy-tailed noise in the stochastic gradients. We show evidence that the noise is not a major factor in the performance gap between SGD and Adam. Instead, we show that a key factor is the class imbalance found in language tasks. In particular, the large number of low-frequency classes causes SGD to converge slowly but has a smaller effect on Adam and sign descent. We show that a gap between SGD and Adam can be induced by adding a large number of low-frequency classes to computer vision models or even to linear models. We further prove in a simple setting that gradient descent converges slowly while sign descent does not.
Talk 3: On Games with Conflicting Interests
Speaker: Baoxiang Wang
Abstract: To investigate differentiable games (that have strategy spaces in $\reals^n$), we decompose the game into components where the dynamic is well understood. We show that any differentiable game can be decomposed as a direct sum of three parts: the first decomposition as an exact potential part, a near vector potential part, and a non-strategic part; the second as a near potential part, an exact vector potential part, and a non-strategic part. A potential part coincides with potential games described by Monderer and Shapley (1996), known as pure cooperative games. A vector potential game on the other hand represents a game with purely conflicting interests. We show that the individual gradient field is divergence-free, in which case the gradient descent dynamic may either be divergent or recurrent. When the divergence-free game is finite, including harmonic games and important classes of zero-sum games, we show that optimistic variants of classical no-regret learning algorithms converge to an $\epsilon$-approximate Nash equilibrium at a rate of $O(1/\epsilon^2).