Session: Recent advances in optimization and statistical estimation on manifolds
Chair: Krishna Balasubramanian
Cluster: Optimization on Manifolds
Talk 1: Riemannian Coordinate Descent Algorithms on Matrix Manifolds
Speaker: Yunrui Guan
Abstract: Many machine learning applications are naturally formulated as optimization problems on Riemannian manifolds. The main idea behind Riemannian optimization is to maintain the feasibility of the variables while moving along a descent direction on the manifold. This results in updating all the variables at every iteration. In this work, we provide a general framework for developing computationally efficient coordinate descent (CD) algorithms on matrix manifolds that allows updating only a few variables at every iteration while adhering to the manifold constraint. In particular, we propose CD algorithms for various manifolds such as Stiefel, Grassmann, (generalized) hyperbolic, symplectic, and symmetric positive (semi)definite. While the cost per iteration of the proposed CD algorithms is low, we further develop a more efficient variant via a first-order approximation of the objective function. We analyze their convergence and complexity, and empirically illustrate their efficacy in several applications.
Talk 2: Online covariance estimation for stochastic gradient descent under non-smoothness
Speaker: Abhishek Roy
Abstract: We investigate the online overlapping batch-means covariance estimator for Stochastic Gradient Descent (SGD) under non-smoothness and establish convergence rates. Our analysis overcomes significant challenges that arise due to non-smoothness, leading to the introduction of additional error terms and handling manifold structures in the solution path. Moreover, we establish the convergence rate for the first four moments of the $\ell_2$ norm of the error of SGD dynamics under non-smoothness which holds potential interest as an independent result. Numerical simulations are provided to illustrate the practical performance of the proposed methodology.
Talk 3: Momentum Stiefel optimizer, with applications to suitably-orthogonal attention, and optimal transport
Speaker: Molei Tao
Abstract: The problem of optimization on Stiefel manifold, i.e., minimizing functions of (not necessarily square) matrices that satisfy orthogonality constraints, has been extensively studied. Yet, a new approach is proposed based on, for the first time, an interplay between thoughtfully designed continuous and discrete dynamics. It leads to a gradient-based optimizer with intrinsically added momentum. This method exactly preserves the manifold structure but does not require additional operation to keep momentum in the changing (co)tangent space, and thus has low computational cost and pleasant accuracy. Its generalization to adaptive learning rates is also demonstrated. Notable performances are observed in practical tasks. For instance, we found that placing orthogonal constraints on attention heads of trained-from-scratch Vision Transformer (Dosovitskiy et al., 2020) could markedly improve its performance, when our optimizer is used, and it is better that each head is made orthogonal within itself but not necessarily to other heads. This optimizer also makes the useful notion of Projection Robust Wasserstein Distance (Paty and Cuturi, 2019; Lin et al., 2020) for high-dim. optimal transport even more effective.