Name: Parallel Sessions 11A: First order methods in machine learning
Start: 2025-07-24T13:15:00-0700
End: 2025-07-24T14:30:00-0700

Thursday July 24, 2025 1:15pm - 2:30pm PDT

Taper Hall (THH) 101

Session: First order methods in machine learning
Chair: Yaoliang Yu
Cluster: Optimization Applications (Communication, Energy, Health, ML, ...)

Talk 1: Independently-Normalized SGD for Generalized-Smooth Nonconvex Optimization
Speaker: Yufeng Yang
Abstract: Recent studies have shown that many nonconvex machine learning problems meet a so-called generalized-smooth condition that extends beyond traditional smooth nonconvex optimization. However, the existing algorithms designed for generalized-smooth nonconvex optimization encounter significant limitations in both their design and convergence analysis. In this work, we first study deterministic generalized-smooth nonconvex optimization and analyze the convergence of normalized gradient descent under the generalized Polyak-Lojasiewicz condition. Our results provide a comprehensive understanding of the interplay between gradient normalization and function geometry. Then, for stochastic generalized-smooth nonconvex optimization, we propose an independently-normalized stochastic gradient descent algorithm, which leverages independent sampling, gradient normalization, and clipping to achieve the standard sample complexity under relaxed assumptions. Experiments demonstrate the fast convergence of our algorithm.

Talk 2: Why does the Adam algorithm work so well for training large language models?
Speaker: Mark Schmidt
Abstract: The success of the Adam optimizer on a wide array of architectures has made it the default in settings where stochastic gradient descent (SGD) performs poorly. However, it is unclear why the gap between Adam and SGD is often big for large language models (LLMs) but small for computer vision benchmarks. Recent work proposed that Adam works better for LLMs due to heavy-tailed noise in the stochastic gradients. We show evidence that the noise is not a major factor in the performance gap between SGD and Adam. Instead, we show that a key factor is the class imbalance found in language tasks. In particular, the large number of low-frequency classes causes SGD to converge slowly but has a smaller effect on Adam and sign descent. We show that a gap between SGD and Adam can be induced by adding a large number of low-frequency classes to computer vision models or even to linear models. We further prove in a simple setting that gradient descent converges slowly while sign descent does not.

Talk 3: On Games with Conflicting Interests
Speaker: Baoxiang Wang
Abstract: To investigate differentiable games (that have strategy spaces in $\reals^n$), we decompose the game into components where the dynamic is well understood. We show that any differentiable game can be decomposed as a direct sum of three parts: the first decomposition as an exact potential part, a near vector potential part, and a non-strategic part; the second as a near potential part, an exact vector potential part, and a non-strategic part. A potential part coincides with potential games described by Monderer and Shapley (1996), known as pure cooperative games. A vector potential game on the other hand represents a game with purely conflicting interests. We show that the individual gradient field is divergence-free, in which case the gradient descent dynamic may either be divergent or recurrent. When the divergence-free game is finite, including harmonic games and important classes of zero-sum games, we show that optimistic variants of classical no-regret learning algorithms converge to an $\epsilon$-approximate Nash equilibrium at a rate of $O(1/\epsilon^2).

Speakers

Mark Schmidt

Professor, UBC

Mark Schmidt is a professor in the Department of Computer Science at the University of British Columbia. His research focuses on developing faster algorithms for large-scale machine learning, and exploring applications of machine learning. He is a Canada Research Chair, Alfred P... Read More →

Yaoliang Yu

Name: Dr. Slothington "Slow Convergence" McNapface Title: Distinguished Professor of Continuous Optimization & Energy Minimization Affiliation: The Lush Canopy Institute of Sluggish Algorithms Bio: Dr. Slothington McNapface is a leading expert in continuous optimization, specializing... Read More →

Baoxiang Wang

Assistant professor, The Chinese University of Hong Kong, Shenzhen

Name: Baoxiang WangTitle: Assistant professorAffiliation: The Chinese University of Hong Kong, ShenzhenBio:Baoxiang Wang is an assistant professor at School of Data Science, The Chinese University of Hong Kong, Shenzhen. Baoxiang works on reinforcement learning and learning theory... Read More →

Yufeng Yang

Thursday July 24, 2025 1:15pm - 2:30pm PDT
Taper Hall (THH) 101 3501 Trousdale Pkwy, 101, Los Angeles, CA 90089

Parallel Session

ICCOPT2025USC

Mark Schmidt

Yaoliang Yu

Baoxiang Wang

Yufeng Yang

Attendees (2)

Get help with the event

ICCOPT2025USC

Mark Schmidt

Yaoliang Yu

Baoxiang Wang

Yufeng Yang

Attendees (2)

Log in to save this to your schedule, view media, leave feedback and see who's attending!

Get help with the event