Session: Frontiers of Optimization for Machine Learning - Part II
Chair: Vivak Patel
Cluster: Nonlinear Optimization
Talk 1: Online Statistics Inference via Matrix-Free Newton Methods
Speaker: Sen Na
Abstract: Given the ubiquity of streaming data, online algorithms have been widely used for parameter estimation, with second-order methods particularly standing out for their efficiency and robustness. In this talk, we introduce an online sketched Newton method that leverages a randomized sketching technique to perform an approximate Newton step in each iteration, thereby eliminating the computational bottleneck of second-order methods. While existing studies have established the asymptotic normality of sketched Newton methods, a consistent estimator of the limiting covariance matrix remains an open problem. We propose a fully online covariance matrix estimator that is constructed entirely from the Newton iterates and requires no matrix factorization. Compared to covariance estimators for first-order online methods, our estimator for second-order methods is batch-free. We establish the consistency and convergence rate of our estimator, and coupled with asymptotic normality results, we can then perform online statistical inference for the model parameters based on sketched Newton methods. We also discuss the extension of our estimator to constrained problems, and demonstrate its superior performance on regression problems as well as benchmark problems in the CUTEst set.
Talk 2: Optimization in Combinatorial and Non-Convex ML: Positive and Negative Results
Speaker: Jean Honorio
Abstract: Several modern machine learning (ML) problems are combinatorial and non-convex, for which theoretical guarantees are quite limited. My long-term research goal is to uncover the general foundations of ML and optimization that drives empirical success. I aim to develop a set of optimization-theoretic frameworks and tools to bridge the aforementioned gaps, to further our understanding of continuous (possibly non-convex) relaxations of combinatorial problems, as well as our knowledge of non-convexity. In this talk, I first focus on invex (non-convex) optimization problems, with some motivation from exploratory research on fairness and mixed linear regression. I present a generalization of gradient descent for the family of invex (non-convex) problems, which provably converges to the global minimum in polynomial time. Finally, for general non-convex problems, I show that any gradient-based algorithm, requires an exponential number of gradients to find the global minimum. Second, when stochastic gradients are biased, how can we obtain convergence to the global minima of a complex convex function? I propose a provably convergent algorithm that requires more computational effort as the algorithm progresses through several gradient descent iterations. Interestingly, more complex algorithms do not converge in this regime. Third, I discuss a difficult combinatorial problem over directed graphs with acyclicity constraints. Interestingly, using the statistical principle of identifiability, one can reduce the search space, and propose provably correct sequential optimization algorithms. Finally, I focus on problems with high-order relationships, usually formulated as tensor optimization problems. I propose a convex conic form relaxation. To this end, I carefully define the Caratheodory symmetric tensor cone, and discuss its benefits in optimization
Talk 3: Hessian-aware scaling of the gradient directions
Speaker: Fred Roosta
Abstract: Gradient descent is the primary workhorse for the optimisation of large-scale, nonconvex problems in machine learning. However, its performance is heavily dependent on step size selection. Due to a lack of natural scaling, this necessitates costly line searches or heuristic guess-and-check methods for step size selection. We propose an efficient scaling of gradient descent using a scalar informed by Hessian information. By carefully considering the curvature along the gradient direction, we demonstrate that Hessian-aware scaled gradient directions provide a local unit step size guarantee, even in the nonconvex setting. We extend this result to scenarios where the Hessian and gradient are stochastic. Additionally, we show global convergence of Hessian-aware scaled gradient descent under a significant weakening of the typical Lipschitz gradient smoothness assumption. We validate our results on large-scale machine learning problems and demonstrate that, through alternating scalings, we obtain an algorithm that rapidly converges across various problems.