Welcome to the homepage of the
Statistics and Machine Learning Reading Group at Carnegie Mellon University!

We are group of faculty and students in Statistics and Machine Learning broadly interested in research at the intersection of these two disciplines.

Unless otherwise notified, our regular weekly meeting this semester is Tues 3-4.30pm in GHC-8102 (starting 1/15/2019). If you would like to present in an upcoming meeting, please signup here. Topics of choice are flexible. As a guideline, the linked document contains a list of interesting papers that we hope to cover this semester. Please email pratik at cmu dot com if you would like join our mailing list.

### Talks:

#### Being Robust (in High Dimensions) Can Be Practical

Tues 1/29, 3pm, Gates 8102
Abstract: Robust estimation is much more challenging in high dimensions than it is in one dimension: Most techniques either lead to intractable optimization problems or estimators that can tolerate only a tiny fraction of errors. Recent work in theoretical computer science has shown that, in appropriate distributional models, it is possible to robustly estimate the mean and covariance with polynomial time algorithms that can tolerate a constant fraction of corruptions, independent of the dimension. However, the sample and time complexity of these algorithms is prohibitively large for high-dimensional applications. In this work, we address both of these issues by establishing sample complexity bounds that are optimal, up to logarithmic factors, as well as giving various refinements that allow the algorithms to tolerate a much larger fraction of corruptions. Finally, we show on both synthetic and real data that our algorithms have state-of-the-art performance and suddenly make high-dimensional robust estimation a realistic possibility.

#### The power of online thinning in reducing discrepancy

Tues 1/15, 3pm, Gates 8102
Abstract: Consider an infinite sequence of independent, uniformly chosen points from [0,1]^d. After looking at each point in the sequence, an overseer is allowed to either keep it or reject it, and this choice may depend on the locations of all previously kept points. However, the overseer must keep at least one of every two consecutive points. We call a sequence generated in this fashion a two-thinning sequence. Here, the purpose of the overseer is to control the discrepancy of the empirical distribution of points, that is, after selecting n points, to reduce the maximal deviation of the number of points inside any axis-parallel hyper-rectangle of volume A from nA. Our main result is an explicit low complexity two-thinning strategy which guarantees discrepancy of O(log^{2d+1}(n)) for all n with high probability [compare with \Theta(\sqrt{n log log n}) without thinning]. The case d=1 of this result answers a question of Benjamini.

#### A Continuous-Time View of Early Stopping for Least Squares Regression

Tues 10/16, 3:30pm, Gates 8102
Speaker: Alnur Ali
Abstract: We study the statistical properties of the iterates generated by gradient descent, applied to the fundamental problem of least squares regression. We take a continuous-time view, i.e., consider infinitesimal step sizes in gradient descent, in which case the iterates form a trajectory called gradient flow. In a random matrix theory setup, which allows the number of samples $n$ and features $p$ to diverge in such a way that $p/n \to \gamma \in (0,\infty)$, we derive and analyze an asymptotic risk expression for gradient flow. In particular, we compare the asymptotic risk profile of gradient flow to that of ridge regression. When the feature covariance is spherical, we show that the optimal asymptotic gradient flow risk is between 1 and 1.25 times the optimal asymptotic ridge risk. Further, we derive a calibration between the two risk curves under which the asymptotic gradient flow risk no more than 2.25 times the asymptotic ridge risk, at all points along the path. We present a number of other results illustrating the connections between gradient flow and $\ell_2$ regularization, and numerical experiments that support our theory.

#### Marchenko-Pastur Asymptotics and the Limiting Risk of Ridge Regression

Tues 10/9, 3:30pm, Gates 8102
Speaker: Ryan Tibshirani
Abstract: I'll give a high-level intro to some basic in random matrix theory, primarily surrounding the Marchenko-Pastur law. Then I'll talk about how this and other more recent results in random matrix theory can be leveraged to understand the asymptotics of ridge regression in a very general and interesting way, due to Dobriban and Wager (2018) https://arxiv.org/pdf/1507.03003.pdf. Time permitting, I'll also mention some recent work by Alnur Ali, Zico Kolter, and myself, where we use similar techniques to develop an understanding of gradient flow (continuous-time gradient descent) and its relationship to ridge.

#### On file-drawer problems and estimation with truncated data

Tues 10/2, 3:30pm, Gates 8102
Speaker: Asaf Weinstein
Abstract: Suppose that you have a copy of a single issue of a medical journal whose policy is to publish 100 findings with p-values<0.05. You take a look at the first article and see a reported p-value of 0.03. What do you make out of this information, being aware of the selection effect and remembering that you only have access to the published articles? A reasonable correction would be to estimate the fraction of nulls among reported findings with p-value<0.03. We discuss ideas for estimating this quantity from only the reported (truncated) p-values. Focusing attention on estimation rather than testing, we propose an empirical Bayes estimator for the mean effect size for the published findings. This can be considered an analogue of Efron’s (2012) methods for the case where only the selected are observed. This is joint work with Amit Meir.

#### Online Learning, Probabilistic Inequalities, and the Burkholder Method

Tues 9/25, 3:30pm, Gates 8102
Speaker: Dylan Foster
Abstract: At first glance, online learning and martingale inequalities may not appear to be intrinically linked. We will showcase a recently discovered equivalence between existence of algorithms for online learning, martingale inequalities, and special "Burkholder" functions. Using this equivalence as a starting point, we define a notion of a sufficient statistic for online learning and use the Burkholder method---originally used to certify probabilistic martingale inequalities---to develop algorithms that only keep these sufficient statistics in memory. To demonstrate the power of the Burkholder method we introduce new efficient and adaptive algorithms for online learning, including an algorithm for matrix prediction that attains a regret bound corresponding to the variance term found in matrix concentration inequalities.

#### Uniform, nonparametric, non-asymptotic confidence sequences

Tues 9/18, 3:30pm, Gates 8102
Speaker: Steven R. Howard
Abstract: A confidence sequence is a sequence of confidence intervals that is uniformly valid over an unbounded time horizon. In this paper, we develop non-asymptotic confidence sequences under nonparametric conditions that achieve arbitrary precision. Our technique draws a connection between the classical Cramer-Chernoff method, the law of the iterated logarithm (LIL), and the sequential probability ratio test (SPRT)—our confidence sequences extend the first to produce time-uniform concentration bounds, provide tight non-asymptotic characterizations of the second, and generalize the third to nonparametric settings, including sub-Gaussian and Bernstein conditions, self-normalized processes, and matrix martin- gales. We strengthen and generalize existing constructions of finite-time iterated logarithm (“finite LIL”) bounds. We illustrate the generality of our proof techniques by deriving an empirical-Bernstein finite LIL bound as well as a novel upper LIL bound for the maximum eigenvalue of a sum of random matrices. Finally, we demonstrate the utility of our approach with applications to covariance matrix estimation and to estimation of sample average treatment effect under the Neyman-Rubin potential outcomes model.

#### Why Adaptively Collected Data Have Negative Bias and How to Correct for It

Sep 11 (Tuesday) at 3:30pm in GHC-8102
Speaker: Jaehyeok Shin
Abstract: From scientific experiments to online A/B testing, the previously observed data often affects how future experiments are performed, which in turn affects which data will be collected. Such adaptivity introduces complex correlations between the data and the collection procedure. In this paper, we prove that when the data collection procedure satisfies natural conditions, then sample means of the data have systematic negative biases. As an example, consider an adaptive clinical trial where additional data points are more likely to be tested for treatments that show initial promise. Our surprising result implies that the average observed treatment effects would underestimate the true effects of each treatment. We quantitatively analyze the magnitude and behavior of this negative bias in a variety of settings. We also propose a novel debiasing algorithm based on selective inference techniques. In experiments, our method can effectively reduce bias and estimation error.

#### Exponential line-crossing inequalities

Sep 4 (Tuesday) at 3:30pm in GHC-8102