Name: Parallel Sessions 6F: Optimization for improving privacy and alignment for LLMs
Start: 2025-07-22T16:15:00-0700
End: 2025-07-22T17:30:00-0700

Tuesday July 22, 2025 4:15pm - 5:30pm PDT

Joseph Medicine Crow Center for International and Public Affairs (DMC) 156

Session: Optimization for improving privacy and alignment for LLMs
Chair: Mingyi Hong
Cluster: Optimization for Emerging Technologies (LLMs, Quantum Computing, ...)

Talk 1: multi-step preference optimization via two-player markov games
Speaker: Volkan Cevher
Abstract: Reinforcement Learning from Human Feedback (RLHF) has been highly success- ful in aligning large language models with human preferences. While prevalent methods like DPO have demonstrated strong performance, they frame interactions with the language model as a bandit problem, which limits their applicability in real-world scenarios where multi-turn conversations are common. Additionally, DPO relies on the Bradley-Terry model assumption, which does not adequately capture the non-transitive nature of human preferences. In this paper, we address these challenges by modeling the alignment problem as a two-player constant-sum Markov game, where each player seeks to maximize their winning rate against the other across all steps of the conversation. Our approach Multi-step Preference Optimization (MPO) is built upon the natural actor-critic framework (Peters & Schaal, 2008). We further develop MPO based on the optimistic online gradient descent algorithm (Rakhlin & Sridharan, 2013; Joulani et al., 2017). Theoretically, we provide a rigorous analysis for both algorithms on convergence and show that 0MPO requires O(ϵ−1) policy updates to converge to an ϵ-approximate Nash equi- librium. We also validate the effectiveness of our method through experiments on the multi-turn conversations dataset in MT-bench-101.

Talk 2: Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration via Bilevel Optimization Improves LLM Alignment
Speaker: Mingyi Hong
Abstract: Aligning human preference and value is an important requirement for contemporary foundation models. State-of-the-art techniques such as Reinforcement Learning from Human Feedback (RLHF) often consist of two stages: 1) supervised fine-tuning (SFT), where the model is fine-tuned by learning from human demonstration data; 2) Preference learning, where preference data is used to learn a reward model, which is in turn used by a reinforcement learning (RL) step to fine-tune the model. Such reward model serves as a proxy to human preference, and it is critical to guide the RL step towards improving the model quality. In this work, we argue that the SFT stage significantly benefits from learning a reward model as well. Instead of using the human demonstration data directly via supervised learning, we propose to leverage an Inverse Reinforcement Learning (IRL) and bilevel optimization technique to simultaneously build an reward model and a policy model. This approach leads to new SFT algorithms that are not only efficient to implement, but are robust to the presence of low-quality supervised learning data. Moreover, we discover a connection between the proposed IRL based approach, and a recent line of works called Self-Play Fine-tune (SPIN). Theoretically, we show that the proposed algorithms converge to the stationary solutions of the IRL problem. Empirically, we align 1B and 7B models using proposed methods and evaluate them on a reward benchmark model and the HuggingFace Open LLM Leaderboard. The proposed methods show significant performance improvement over existing SFT approaches. Our results indicate that it is beneficial to leverage reward learning throughout the entire alignment process.

Talk 3: Pre-training Differentially Private Models with Limited Public Data
Speaker: Xinwei Zhang
Abstract: The superior performance of large foundation models relies on the use of massive amounts of high-quality data, which often contain sensitive, private, and copyrighted material that requires formal protection. While differential privacy (DP) is a prominent method to gauge the degree of security provided to the models, its application is commonly limited to the model fine-tuning stage due to the performance degradation when DP is applied during the pre-training stage. Consequently, DP is yet incapable of protecting a substantial portion of the data used during the initial pre-training process. In this work, we provide a theoretical understanding of the efficacy of DP training by analyzing the improvement of per-iteration loss through the lens of the Hessian matrix for large neural networks. We make a key observation that DP optimizers' performance degradation can be significantly mitigated by the use of limited public data, which leads to a novel DP continual pre-training strategy. Empirically, using only 10\% of public data, our strategy can achieve DP accuracy of 41.5% on ImageNet-21k (with =8), as well as non-DP accuracy of 55.7% and 60.0% on downstream tasks Places365 and iNaturalist-2021, respectively, on par with state-of-the-art standard pre-training and substantially outperforming existing DP pre-trained models.

Speakers

Volkan Cevher

Name: Dr. Slothington "Slow Convergence" McNapface Title: Distinguished Professor of Continuous Optimization & Energy Minimization Affiliation: The Lush Canopy Institute of Sluggish Algorithms Bio: Dr. Slothington McNapface is a leading expert in continuous optimization, specializing... Read More →

Mingyi Hong

Xinwei Zhang

Tuesday July 22, 2025 4:15pm - 5:30pm PDT
Joseph Medicine Crow Center for International and Public Affairs (DMC) 156 3518 Trousdale Pkwy, 156, Los Angeles, CA 90089

Parallel Session

ICCOPT2025USC

Volkan Cevher

Mingyi Hong

Xinwei Zhang

Attendees (2)

ICCOPT2025USC

Volkan Cevher

Mingyi Hong

Xinwei Zhang

Attendees (2)

Log in to save this to your schedule, view media, leave feedback and see who's attending!