Session: Fixed Point and Min-Max Theory in Data Science
Chair: Ahmet Alacaoglu; Yura Malitsky; Stephen J. Wright
Cluster: Fixed Points and Variational Inequalities
Talk 1: Near-Optimal Sample Complexity for MDPs via Anchoring
Speaker: Roberto Cominetti
Abstract: We study a new model-free algorithm to compute $\varepsilon$-optimal policies for average reward Markov decision processes, in the weakly communicating case. Given a generative model, our procedure combines a recursive sampling technique with Halpern's anchored iteration, and computes an $\varepsilon$-optimal policy with sample and time complexity $\widetilde{O}(|\mathcal{S}||\mathcal{A}|\|h^*\|_{sp}^{2}/\varepsilon^{2})$ both in high probability and in expectation. To our knowledge, this is the best complexity among model-free algorithms, matching the known lower bound up to a factor $\|h^*\|_{sp}$. Although the complexity bound involves the span seminorm $\|h^*\|_{sp}$ of the unknown bias vector, the algorithm requires no prior knowledge and implements a stopping rule which guarantees with probability 1 that the procedure terminates in finite time. We also analyze how these techniques can be adapted for discounted MDPs. Joint work with Mario Bravo and Jongmin Lee.
Talk 2: On the Convergence of Self-Play in Games with Bandit Feedback
Speaker: Haipeng Luo
Abstract: Self-play via online learning algorithms has emerged as a powerful paradigm for solving minimax problems and more broadly training agents in complex multi-agent environments, underpinning some of the most striking successes in AI, such as AlphaGo and superhuman AI for poker. It enables agents to improve iteratively by interacting with copies of themselves, offering a natural path toward equilibria in games. Despite heavy study in recent years, however, the convergence of self-play dynamics is still not fully understood, especially in the case where agents only observe the reward of the played actions (i.e., bandit feedback), a prevalent situation in many applications. In this talk, I will discuss two recent results on this front. In the first part, motivated by the well-known accelerated average-iterate convergence rate when learning with gradient feedback, I will address the question of whether acceleration is also possible under bandit feedback and provide an affirmative answer using a simple algorithm that enjoys instance-dependent convergence rates. In the second part, I will switch focus to the more challenging notion of last-iterate convergence and discuss a new uncoupled self-play dynamic with improved convergence rate, leveraging a simple black-box reduction from last-iterate convergence to average-iterate convergence.
Talk 3: A first-order algorithm for decentralised min-max problems
Speaker: Matthew Tam
Abstract: In this work, we consider a connected network of finitely many agents working cooperatively to solve a min-max problem with convex-concave structure. We propose a decentralised first-order algorithm which can be viewed as combining features of two algorithms: PG-EXTRA for decentralised minimisation problems and the forward reflected backward method for (non-distributed) min-max problems. In each iteration of our algorithm, each agent computes the gradient of the smooth component of its local objective function as well as the proximal operator of its nonsmooth component, following by a round of communication with its neighbours.