Welcome to the homepage of the **Statistics and Machine Learning Reading Group** at **Carnegie Mellon University**!

We are group of faculty and students in Statistics and Machine Learning broadly interested in research at the intersection of these two disciplines.

Unless otherwise notified, our regular meeting is on **Tues 3:30-5pm in GHC-8102** every week (starting 9/4/2018). Please email *alnurali at gmail dot com* if you would like join our mailing list.

If you would like to present in an upcoming meeting, please signup here.

Topics of choice are flexible. As a guideline, here is a list of interesting papers that we hope to read this semster.

**Tues 1/15, 3pm, Gates 8102**
**Speaker:**Aaditya Ramdas

**Abstract:**
Consider an infinite sequence of independent, uniformly chosen points from [0,1]^d. After looking at each point in the sequence, an overseer is allowed to either keep it or reject it, and this choice may depend on the locations of all previously kept points. However, the overseer must keep at least one of every two consecutive points. We call a sequence generated in this fashion a two-thinning sequence. Here, the purpose of the overseer is to control the discrepancy of the empirical distribution of points, that is, after selecting n points, to reduce the maximal deviation of the number of points inside any axis-parallel hyper-rectangle of volume A from nA. Our main result is an explicit low complexity two-thinning strategy which guarantees discrepancy of O(log^{2d+1}(n)) for all n with high probability [compare with \Theta(\sqrt{n log log n}) without thinning]. The case d=1 of this result answers a question of Benjamini.

**Tues 10/16, 3:30pm, Gates 8102**
**Speaker:** Alnur Ali

**Abstract:**
We study the statistical properties of the iterates generated by gradient descent, applied to the fundamental problem of least squares regression. We take a continuous-time view, i.e., consider infinitesimal step sizes in gradient descent, in which case the iterates form a trajectory called gradient flow. In a random matrix theory setup, which allows the number of samples $n$ and features $p$ to diverge in such a way that $p/n \to \gamma \in (0,\infty)$, we derive and analyze an asymptotic risk expression for gradient flow. In particular, we compare the asymptotic risk profile of gradient flow to that of ridge regression. When the feature covariance is spherical, we show that the optimal asymptotic gradient flow risk is between 1 and 1.25 times the optimal asymptotic ridge risk. Further, we derive a calibration between the two risk curves under which the asymptotic gradient flow risk no more than 2.25 times the asymptotic ridge risk, at all points along the path. We present a number of other results illustrating the connections between gradient flow and $\ell_2$ regularization, and numerical experiments that support our theory.

**Tues 10/9, 3:30pm, Gates 8102**
**Speaker:** Ryan Tibshirani

**Abstract:**
I'll give a high-level intro to some basic in random matrix
theory, primarily surrounding the Marchenko-Pastur law. Then I'll
talk about how this and other more recent results in random matrix
theory can be leveraged to understand the asymptotics of ridge
regression in a very general and interesting way, due to Dobriban and
Wager (2018) https://arxiv.org/pdf/1507.03003.pdf. Time permitting,
I'll also mention some recent work by Alnur Ali, Zico Kolter, and
myself, where we use similar techniques to develop an understanding of
gradient flow (continuous-time gradient descent) and its relationship
to ridge.

**Tues 10/2, 3:30pm, Gates 8102**
**Speaker:** Asaf Weinstein

**Abstract:**
Suppose that you have a copy of a single issue of a medical journal whose policy is to publish 100 findings with p-values<0.05. You take a look at the first article and see a reported p-value of 0.03. What do you make out of this information, being aware of the selection effect and remembering that you only have access to the published articles? A reasonable correction would be to estimate the fraction of nulls among reported findings with p-value<0.03. We discuss ideas for estimating this quantity from only the reported (truncated) p-values. Focusing attention on estimation rather than testing, we propose an empirical Bayes estimator for the mean effect size for the published findings. This can be considered an analogue of Efron’s (2012) methods for the case where only the selected are observed.
This is joint work with Amit Meir.

**Tues 9/25, 3:30pm, Gates 8102**
**Speaker:** Dylan Foster

**Abstract:**
At first glance, online learning and martingale inequalities may not appear to be intrinically linked. We will showcase a recently discovered equivalence between existence of algorithms for online learning, martingale inequalities, and special "Burkholder" functions. Using this equivalence as a starting point, we define a notion of a sufficient statistic for online learning and use the Burkholder method---originally used to certify probabilistic martingale inequalities---to develop algorithms that only keep these sufficient statistics in memory. To demonstrate the power of the Burkholder method we introduce new efficient and adaptive algorithms for online learning, including an algorithm for matrix prediction that attains a regret bound corresponding to the variance term found in matrix concentration inequalities.

**Tues 9/18, 3:30pm, Gates 8102**
**Speaker:** Steven R. Howard

**Abstract:**
A confidence sequence is a sequence of confidence intervals that is uniformly valid over an unbounded time horizon. In this paper, we develop non-asymptotic confidence sequences under nonparametric conditions that achieve arbitrary precision. Our technique draws a connection between the classical Cramer-Chernoff method, the law of the iterated logarithm (LIL), and the sequential probability ratio test (SPRT)—our confidence sequences extend the first to produce time-uniform concentration bounds, provide tight non-asymptotic characterizations of the second, and generalize the third to nonparametric settings, including sub-Gaussian and Bernstein conditions, self-normalized processes, and matrix martin- gales. We strengthen and generalize existing constructions of finite-time iterated logarithm (“finite LIL”) bounds. We illustrate the generality of our proof techniques by deriving an empirical-Bernstein finite LIL bound as well as a novel upper LIL bound for the maximum eigenvalue of a sum of random matrices. Finally, we demonstrate the utility of our approach with applications to covariance matrix estimation and to estimation of sample average treatment effect under the Neyman-Rubin potential outcomes model.

**Sep 11 (Tuesday) at 3:30pm in GHC-8102**
**Speaker:** Jaehyeok Shin

**Abstract:**
From scientific experiments to online A/B testing, the previously observed data often affects how future experiments are performed, which in turn affects which data will be collected. Such adaptivity introduces complex correlations between the data and the collection procedure. In this paper, we prove that when the data collection procedure satisfies natural conditions, then sample means of the data have systematic negative biases. As an example, consider an adaptive clinical trial where additional data points are more likely to be tested for treatments that show initial promise. Our surprising result implies that the average observed treatment effects would underestimate the true effects of each treatment. We quantitatively analyze the magnitude and behavior of this negative bias in a variety of settings. We also propose a novel debiasing algorithm based on selective inference techniques. In experiments, our method can effectively reduce bias and estimation error.

**Sep 4 (Tuesday) at 3:30pm in GHC-8102**
**Speaker:** Aaditya Ramdas

**Abstract:**
This paper develops a class of exponential bounds for the probability that a martingale sequence crosses a time-dependent linear threshold. Our key insight is that it is both natural and fruitful to formulate exponential concentration inequalities in this way. We illustrate this point by recovering and improving many tail bounds for martingales, including classical inequalities (1960-80) by Bernstein, Bennett, Hoeffding, and Freedman; contemporary inequalities (1980-2000) by Shorack and Wellner, Pinelis, Blackwell, van de Geer, and de la Pena; and several modern inequalities (post-2000) by Khan, Tropp, Bercu and Touati, Delyon, and others. In each of these cases, we give the strongest and most general statements to date, quantifying the time-uniform concentration of scalar, matrix, and Banach-space-valued martingales, under a variety of nonparametric assumptions in discrete and continuous time. By choosing the optimal linear bound for a given time, we bridge the gap between existing line-crossing inequalities, the sequential probability ratio test, the Cramer-Chernoff method, self-normalized processes, and other parts of the literature.