Session: Efficient Optimization Methods for LLMs (Part I)
Chair: Ruoyu Sun
Cluster: Optimization for Emerging Technologies (LLMs, Quantum Computing, ...)
Talk 1: Why Transformers Need Adam and Memory-Efficient Adam-mini
Speaker: Ruoyu Sun
Abstract: Adam is a popular algorithm for training large language models (LLMs), yet its underlying mechanisms remain poorly understood. We explain why Adam significantly outperforms SGD in training Transformers: (i) The Hessian spectra vary drastically across different parameter blocks in Transformers. (ii) SGD performs poorly on problems with such block-wise heterogeneity. Inspired by these insights, we propose Adam-mini, a new optimizer that matches Adam’s performance in tasks like Llama3-8B pretraining while reducing memory usage by 35–50%—or increasing throughput by 33–50% under the same memory constraints.
Talk 2: Democratizing LLM Training via Low-Rank Methods
Speaker: Zhenyu Zhang
Abstract: In this talk, we explore the evolution of low-rank optimization techniques for Large Language Models (LLMs), tracing the progression from GaLore to APOLLO. GaLore introduced a memory-efficient training paradigm by applying low-rank projections to gradients and optimizer states, enabling the training of models with up to 7 billion parameters on consumer-grade GPUs such as the NVIDIA RTX 4090. This significantly reduced memory consumption and democratized access to large-scale LLM training. Building on this foundation, APOLLO demonstrates the surprising effectiveness of rank-1 optimizer states with purely random projections, achieving SGD-level memory overhead while maintaining performance comparable to AdamW. This method brings substantial system-level improvements, including a 3× increase in throughput on 8×A100-80G GPUs and enhanced model scalability. Notably, it enables pre-training of the LLaMA-13B model using naive DDP on A100-80G GPUs without requiring additional system-level optimizations, and supports training models with up to 7 billion parameters within just 12GB of GPU memory.
Talk 3: Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling
Speaker: Lechao Xiao
Abstract: The remarkable success of large language pretraining and the discovery of scaling laws signify a paradigm shift in machine learning. Notably, the primary objective has evolved from minimizing generalization error to reducing approximation error, and the most effective strategy has transitioned from regularization (in a broad sense) to scaling up models. This raises a critical question: Do the established principles that proved successful in the generalization-centric era remain valid in this new era of scaling? This talk examines several influential regularization-based principles that may no longer hold true in the scaling-centric, large language model (LLM) era. These principles include explicit L2 regularization and implicit regularization through small batch sizes and large learning rates. Additionally, we identify a new phenomenon termed ``scaling law crossover,'' where two scaling curves intersect at a certain scale, implying that methods effective at smaller scales may not generalize to larger ones. Together, these observations highlight two fundamental questions within this new paradigm: Guiding Principles for Scaling: If regularization is no longer the primary guiding principle for model design, what new principles are emerging to guide scaling? Model Comparison at Scale: How to reliably and effectively compare models at the scale where only a single experiment is feasible?