Session: Stochastic Gradient Descent (SGD) and Variants
Chair: Melinda Hagedorn
Cluster: nan
Talk 1: Optimized convergence of stochastic gradient descent by weighted averaging
Speaker: Melinda Hagedorn
Abstract: Under mild assumptions stochastic gradient methods asymptotically achieve an optimal rate of convergence if the arithmetic mean of all iterates is returned as an approximate optimal solution. However, in the absence of stochastic noise, the arithmetic mean of all iterates converges considerably slower to the optimal solution than the iterates themselves. And also in the presence of noise, when a termination of the stochastic gradient method after a finite number of steps is considered, the arithmetic mean is not necessarily the best possible approximation to the unknown optimal solution. This paper aims at identifying optimal strategies in a particularly simple case, the minimization of a strongly convex function with i.i.d. noise terms and termination after a finite number of steps. Explicit formulas for the stochastic error and the optimality error are derived in dependence of certain parameters of the SGD method. The aim was to choose parameters such that both stochastic error and optimality error are reduced compared to arithmetic averaging. This aim could not be achieved; however, by allowing a slight increase of the stochastic error it was possible to select the parameters such that a significant reduction of the optimality error could be achieved. This reduction of the optimality error has a strong effect on the approximate solution generated by the stochastic gradient method in case that only a moderate number of iterations is used or when the initial error is large. The numerical examples confirm the theoretical results and suggest that a generalization to non-quadratic objective functions may be possible. This paper was written together with Professor Florian Jarre and is already published in Optimization Methods and Software 39 (2024), Nr. 4, S. 699–724
Talk 2: Linear Convergence Rate in Convex Setup is Possible! Gradient Descent Method Variants under (L0, L1)-Smoothness
Speaker: Aleksandr Lobanov
Abstract: The gradient descent (GD) method -- is a fundamental and likely the most popular optimization algorithm in machine learning (ML), with a history traced back to a paper in 1847 \cite{Cauchy_1847}. In this paper, we provide an improved convergence analysis of gradient descent and its variants, assuming generalized smoothness $(L_0,L_1)$. In particular, we show that GD has the following behavior of convergence in the \textit{convex setup}: as long as $\norms{\nabla f(x^k)} \geq \frac{L_0}{L_1}$ the algorithm has \textit{linear convergence}, and approaching the solution $x^*$ such that $\norms{\nabla f(x^k)} < \frac{L_0}{L_1}$, has standard sublinear rate. Moreover, we show that this behavior of convergence is also common for its variants using different types of oracle: \textit{Normalized Gradient Descent} as well as \textit{Clipped Gradient Descent} (the case when the oracle has access to the full gradient $\nabla f(x)$); \textit{Random Coordinate Descent} (when the oracle has access only to the gradient component $\nabla_{i} f(x)$); \textit{Random Coordinate Descent with Order Oracle} (when the oracle has access only to the comparison value of the objective function $\text{sign} [f(y) - f(x)]$). In addition, we also analyze the behavior of convergence rate of GD algorithm in a strongly convex setup. Finally, we validate our theoretical results via numerical experiment. https://arxiv.org/pdf/2412.17050
Talk 3: High Probability Guarantees for Random Reshuffling
Speaker: Hengxu Yu
Abstract: We consider the stochastic gradient method with random reshuffling (RR) for tackling smooth nonconvex optimization problems. RR finds broad applications in practice, notably in training neural networks. In this work, we provide high probability first-order and second-order complexity guarantees for this method. First, we establish a high probability first-order sample complexity result for driving the Euclidean norm of the gradient (without taking expectation) below a required accuracy. Our derived complexity matches the best existing in-expectation one up to a logarithmic term while imposing no additional assumptions nor changing RR's updating rule. We then propose a simple and computable stopping criterion for RR (denoted as RR-sc). This criterion is guaranteed to be triggered after a finite number of iterations, enabling us to prove a high probability first-order complexity guarantee for the last iterate. Second, building on the proposed stopping criterion, we design a perturbed random reshuffling method (p-RR) that involves an additional randomized perturbation procedure near stationary points. We derive that p-RR provably escapes strict saddle points and establish a high probability second-order complexity result, without requiring any sub-Gaussian tail-type assumptions on the stochastic gradient errors. The fundamental ingredient in deriving the aforementioned results is a new concentration property for sampling without replacement in RR, which could be of independent interest. Finally, we conduct numerical experiments on neural network training to support our theoretical findings. The full preprint paper, which is under revision for SIOPT, can be found at https://arxiv.org/abs/2311.11841