Statistical inference for model parameters in stochastic gradient descent

02 Feb, 2024, 3:00-4:30 pm, GHC 8102

Speaker: Selina Carter

Abstract: The main paper by Chen et al (2021) is here: https://arxiv.org/abs/1610.08637 Usually when we think of gradient descent algorithms, we seek an estimator for the "true parameter theta" that minimizes the loss function for the population. Often (especially for maching learning applications) we are satisfied with a point estimate with convergence guarantees. But can we use SGD to produce a confidence interval for theta? Yes. The idea is the following: "Averaged SGD" (ASGD), also known as Polyak-Ruppert averaging, is known to have desirable asymptotic properties (Polyak-Juditsky1992): not only does it converge to the population minimizing parameter under a convex objective function (with some key assumptions), it also has an asymptotic normal distribution. However, this result isn't useful for inference unless we can also estimate the asymptotic variance in an online fashion. Chen et al (2021) come up with 2 methods to do so. I will cover the following background: (1) GD vs SGD vs ASGD (2) martingales and the martingale CLT (3) the main proof (sketch) of the Polyak-Juditsky 1992 asymptotic normality result for ASGD (4) Chen et al's (2021) estimators for conducting inference via ASGD (5) If we have time, other useful notes on finite sample inference via ASGD