Loading…
Monday July 21, 2025 1:15pm - 2:30pm PDT
Session: Efficient Optimization Methods for LLMs (Part I)
Chair: Ruoyu Sun
Cluster: Optimization for Emerging Technologies (LLMs, Quantum Computing, ...)

Talk 1: Why Transformers Need Adam and Memory-Efficient Adam-mini
Speaker: Ruoyu Sun
Abstract: Adam is a popular algorithm for training large language models (LLMs), yet its underlying mechanisms remain poorly understood. We explain why Adam significantly outperforms SGD in training Transformers: (i) The Hessian spectra vary drastically across different parameter blocks in Transformers. (ii) SGD performs poorly on problems with such block-wise heterogeneity. Inspired by these insights, we propose ​​Adam-mini​​, a new optimizer that matches Adam’s performance in tasks like Llama3-8B pretraining while reducing memory usage by ​​35–50%​​—or increasing throughput by ​​33–50%​​ under the same memory constraints.

Talk 2: Democratizing LLM Training via Low-Rank Methods
Speaker: Zhenyu Zhang
Abstract: In this talk, we explore the evolution of low-rank optimization techniques for Large Language Models (LLMs), tracing the progression from GaLore to APOLLO. GaLore introduced a memory-efficient training paradigm by applying low-rank projections to gradients and optimizer states, enabling the training of models with up to 7 billion parameters on consumer-grade GPUs such as the NVIDIA RTX 4090. This significantly reduced memory consumption and democratized access to large-scale LLM training. Building on this foundation, APOLLO demonstrates the surprising effectiveness of rank-1 optimizer states with purely random projections, achieving SGD-level memory overhead while maintaining performance comparable to AdamW. This method brings substantial system-level improvements, including a 3× increase in throughput on 8×A100-80G GPUs and enhanced model scalability. Notably, it enables pre-training of the LLaMA-13B model using naive DDP on A100-80G GPUs without requiring additional system-level optimizations, and supports training models with up to 7 billion parameters within just 12GB of GPU memory.

Talk 3: Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling
Speaker: Lechao Xiao
Abstract: The remarkable success of large language pretraining and the discovery of scaling laws signify a paradigm shift in machine learning. Notably, the primary objective has evolved from minimizing generalization error to reducing approximation error, and the most effective strategy has transitioned from regularization (in a broad sense) to scaling up models. This raises a critical question: Do the established principles that proved successful in the generalization-centric era remain valid in this new era of scaling? This talk examines several influential regularization-based principles that may no longer hold true in the scaling-centric, large language model (LLM) era. These principles include explicit L2 regularization and implicit regularization through small batch sizes and large learning rates. Additionally, we identify a new phenomenon termed ``scaling law crossover,'' where two scaling curves intersect at a certain scale, implying that methods effective at smaller scales may not generalize to larger ones. Together, these observations highlight two fundamental questions within this new paradigm:  Guiding Principles for Scaling: If regularization is no longer the primary guiding principle for model design, what new principles are emerging to guide scaling?  Model Comparison at Scale: How to reliably and effectively compare models at the scale where only a single experiment is feasible?

Speakers
RS

Ruoyu Sun

Name: Dr. Slothington "Slow Convergence" McNapface Title: Distinguished Professor of Continuous Optimization & Energy Minimization Affiliation: The Lush Canopy Institute of Sluggish Algorithms Bio: Dr. Slothington McNapface is a leading expert in continuous optimization, specializing... Read More →
avatar for Zhenyu Zhang

Zhenyu Zhang

Ph.D. Student, University of Texas at Austin
Monday July 21, 2025 1:15pm - 2:30pm PDT
Taper Hall (THH) 212 3501 Trousdale Pkwy, 212, Los Angeles, CA 90089

Attendees (6)


Log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link