Session: Optimization for Large Language Models and Kernels
Chair: Ming Yin
Cluster: Optimization Applications (Communication, Energy, Health, ML, ...)
Talk 1: Optimizing for a Proxy Reward in RLHF
Speaker: Banghua Zhu
Abstract: Reinforcement Learning from Human Feedback (RLHF) has become an important technique in post-training of Larger Language Models (LLM). During RLHF, one usually first trains a reward model from human preference data, and then optimizes the LLM for the proxy reward signal predicted by the reward model. In this talk, I'll discuss what makes a good reward model for RLHF from both theoretical and empirical observations.
Talk 2: Self-Play Preference Optimization for Language Model Alignment
Speaker: Yue Wu
Abstract: In this paper, we propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game aimed at optimizing the model to approximate the Nash equilibrium. Our approach, dubbed SPPO, is based on a new alignment objective derived from L2 regression. Interestingly, this new objective has a deep connection with the KL-regularized policy gradient and natural gradient methods, and can guarantee the convergence to the optimal solution. In our experiments, this theoretically motivated objective turns out highly effective. By leveraging a small pre-trained preference model, SPPO can obtain a highly-aligned model without additional external supervision from human or stronger language models.
Talk 3: Learning Counterfactual Distributions via Kernel Nearest Neighbors
Speaker: Kyuseong Choi
Abstract: Consider a setting with multiple units (e.g., individuals, cohorts, geographic locations) and outcomes (e.g., treatments, times, items), where the goal is to learn a multivariate distribution for each unit-outcome entry, such as the distribution of a user's weekly spend and engagement under a specific mobile app version. A common challenge is the prevalence of missing not at random data---observations are available only for certain unit-outcome combinations---where the observed distributions can be correlated with properties of distributions themselves, i.e., there is unobserved confounding. An additional challenge is that for any observed unit-outcome entry, we only have a finite number of samples from the underlying distribution. We tackle these two challenges by casting the problem into a novel distributional matrix completion framework and introduce a kernel-based distributional generalization of nearest neighbors to estimate the underlying distributions. By leveraging maximum mean discrepancies and a suitable factor model on the kernel mean embeddings of the underlying distributions, we establish consistent recovery of the underlying distributions even when data is missing not at random and positivity constraints are violated. Furthermore, we demonstrate that our nearest neighbors approach is robust to heteroscedastic noise, provided we have access to two or more measurements for the observed unit-outcome entries—a robustness not present in prior works on nearest neighbors with single measurements.