Sgd Overparametrized
wc_sgd_overparametrized(L, mu, gamma, n; solver=Clarabel.Optimizer, verbose=true)Problem statement
Compute a PEPit worst-case guarantee for wc_sgd_overparametrized.
Consider the finite sum minimization problem
\[F_\star \triangleq \min_x \left\{F(x) \equiv \frac{1}{n} \sum_{i=1}^n f_i(x)\right\},\]
where $f_1, ..., f_n$ are $L$-smooth and $\mu$-strongly convex. In the sequel, we use the notation $\mathbb{E}$ for denoting the expectation over the uniform distribution of the index $i \sim \mathcal{U}\left([|1, n|]\right)$, e.g., $F(x)\equiv\mathbb{E}[f_i(x)]$. In addition, we assume a zero variance at the optimal point (which is denoted by $x_\star$):
\[\mathbb{E}\left[\|\nabla f_i(x_\star)\|^2\right] = \frac{1}{n} \sum_{i=1}^n \|\nabla f_i(x_\star)\|^2 = 0,\]
where the expectation $\mathbb{E}$ is taken over the uniform distribution of the index $i \sim \mathcal{U}\left([|1, n|]\right)$.
This kind of situations happens for example in machine learning in the interpolation regime, that is if there exists a model $x_\star$ such that the loss $\mathcal{L}$ on any observation $(z_i)_{i \in [|1, n|]}$, $\mathcal{L}(x_\star, z_i) = f_i(x_\star)$ is zero.
Performance metric
This code computes a worst-case guarantee for one step of the stochastic gradient descent (SGD) in expectation, for the distance to optimal point. That is, it computes the smallest possible $\tau(L, \mu, \gamma, n)$ such that
\[\mathbb{E}\left[\|x_1 - x_\star\|^2\right] \leqslant \tau(L, \mu, \gamma, n) \|x_0 - x_\star\|^2\]
is valid, where $x_1$ is the output of one step of SGD.
Algorithm
One iteration of SGD is described by:
\[\begin{aligned} \text{Pick random }i & \sim & \mathcal{U}\left([|1, n|]\right), \\ x_{t+1} & = & x_t - \gamma \nabla f_{i}(x_t), \end{aligned}\]
where $\gamma$ is a step-size.
Theoretical guarantee
An empirically tight one-iteration guarantee is provided in the code of PESTO [1]:
\[\mathbb{E}\left[\|x_1 - x_\star\|^2\right] \leqslant \left(\max(1 - \gamma\mu, L\gamma - 1\right)\right)^2 \|x_0-x_\star\|^2.\]
Note that we observe the guarantee does not depend on the number $n$ of functions for this particular setting, thereby implying that the guarantees are also valid for expectation minimization settings (i.e., when $n$ goes to infinity).
References
Empirically tight guarantee provided in code of [1]. Using SDPs for analyzing SGD-type method was proposed in [2, 3].
Arguments
L: smoothness or Lipschitz parameter, as used by the modeled class.mu: strong convexity or monotonicity parameter, as used by the modeled class.gamma: step-size parameter.n: number of iterations.solver: JuMP optimizer constructor used to solve the generated SDP.verbose: print example and solver progress information when true.
Returns
pepit_tau: worst-case valuetheoretical_tau: theoretical value
Julia usage
pepit_tau, theoretical_tau = wc_sgd_overparametrized(1.0, 0.1, 2.3, 5; solver=Clarabel.Optimizer, verbose=true)
## Returns approximately: (pepit_tau, theoretical_tau) = (1.69, 1.69)