Welcome to the homepage of the **Statistics and Machine Learning Research Group** at **Carnegie Mellon University**!

We are group of faculty and students in Statistics and Machine Learning broadly interested in research at the intersection of these two disciplines.

Unless otherwise notified, our regular meeting is on **Weds 2-3pm in GHC-8102** every week. Please email *vsadhana AT cs.cmu.edu* if you would like join our mailing list.

If you would like to present in an upcoming meeting, please signup here.

Topics of choice are flexible. As a guideline, here is a list of interesting papers that we hope to read this semster.

**Abstract:**
A number of statistical estimation problems can be addressed by semidefinite programs (SDP). While SDPs are solvable in polynomial time using interior point methods, in practice generic SDP solvers do not scale well to high-dimensional problems. In order to cope with this problem, Burer and Monteiro proposed a non-convex rank-constrained formulation, which has good performance in practice but is still poorly understood theoretically.

In this paper we study the rank-constrained version of SDPs arising in MaxCut and in synchronization problems. We establish a Grothendieck-type inequality that proves that all the local maxima and dangerous saddle points are within a small multiplicative gap from the global maximum. We use this structural information to prove that SDPs can be solved within a known accuracy, by applying the Riemannian trust-region method to this non-convex problem, while constraining the rank to be of order one. For the MaxCut problem, our inequality implies that any local maximizer of the rank-constrained SDP provides a (1−1/(k−1))×0.878 approximation of the MaxCut, when the rank is fixed to k. We then apply our results to data matrices generated according to the Gaussian ℤ2 synchronization problem, and the two-groups stochastic block model with large bounded degree. We prove that the error achieved by local maximizers undergoes a phase transition at the same threshold as for information-theoretically optimal methods.

Link to paper: https://arxiv.org/abs/1703.08729

**Abstract:**
I will discuss some recent theoretical results on Thompson sampling for Multi-armed Bandits. The discussion will be based on the following line of work,

- http://www.jmlr.org/papers/volume17/14-087/14-087.pdf
- http://djrusso.github.io/docs/Learning_to_Optimize.pdf
- https://arxiv.org/abs/1403.5556

**Abstract:**
In this talk, I will review some of the well known and less well known statistical results on neural networks. Specifically, I will present minimax rates and generalization error bounds for shallow neural networks, which appear to break the curse of dimensionality that other nonparametric estimators are known to suffer. In the first part of the talk, I will give a brief review of neural networks and present their universal approximation property as well as their minimax rates of convergence [1]. In the second part of the talk, I will introduce convex neural networks and their connections to RKHS to show how shallow neural networks with the popular rectified linear units (ReLUs) can achieve a sparsity-adaptive generalization error bound [2]. Time permitting, I will also mention some recent work that explains the effect of depth in deep learning.

- Yang, Y., Barron, A. (1999). Information-theoretic determination of minimax rates of convergence. Annals of Statistics, 1564-1599. link
- Bach, F. (2014). Breaking the curse of dimensionality with convex neural networks. Research Report, INRIA Paris. link

**Abstract:**
Cross-validation is one of the most popular model selection methods in statistics and machine learning. Despite its wide applicability, traditional cross-validation methods tend to select overfitting models, unless the ratio between the training and testing sample sizes is much smaller than conventional choices. We argue that such an overfitting tendency of cross-validation is due to the ignorance of the uncertainty in the testing sample. Starting from this observation, we develop a new, statistically principled inference tool based on cross-validation that takes into account the uncertainty in the testing sample. This new method outputs a small set of highly competitive candidate models containing the best one with guaranteed probability. As a consequence, our method can achieve consistent variable selection in a classical linear regression setting, for which existing cross-validation methods require unconventional split ratios. We demonstrate the performance of the proposed method in several simulated and real data examples.

Link: https://arxiv.org/abs/1703.07904

**Abstract:**
Every day Stanford hospital (SHC) orders some number of units of platelets from the Stanford Blood Center (SBC). Although SBC collects these units daily from donors, the units are held in testing for first 2 days, after which they can be released and can only be used for the next 3 days, before they expire. Currently the number of units ordered by SHC is based on historical daily averages. While there is rarely a shortage, many units are wasted over the year. Using information that is available about the patients and planned procedures in the hospital, we have constructed a supervised learning model and an ordering scheme that on backtesting reduces the number of units wasted by two-thirds without any shortages. This will potentially allow for optimized SBC donor collection strategy and SHC ordering strategy to drastically reduce the cost of wastage without compromising patient care.

**Abstract:**
We show that given an estimate $\widehat{A}$ that is close to a general high-rank positive semi-definite (PSD) matrix $\widehat{A}$ in spectral norm, the simple truncated SVD of Ab produces a multiplicative approximation of A in Frobenius norm. This observation leads to many interesting results on general high rank matrix estimation problems, including high rank matrix completion, high rank matrix denoising and low-rank estimation of high-dimensional covariance.

Link: https://arxiv.org/abs/1702.06861

Ed will present this recent paper by Victor Chernozhukov et. al.

**Abstract:**
Most modern supervised statistical/machine learning (ML) methods are explicitly designed to solve prediction problems very well. Achieving this goal does not imply that these methods automatically deliver good estimators of causal parameters. Examples of such parameters include individual regression coefficients, average treatment effects, average lifts, and demand or supply elasticities. In fact, estimates of such causal parameters obtained via naively plugging ML estimators into estimating equations for such parameters can behave very poorly due to the regularization bias. Fortunately, this regularization bias can be removed by solving auxiliary prediction problems via ML tools. Specifically, we can form an orthogonal score for the target low-dimensional parameter by combining auxiliary and main ML predictions. The score is then used to build a de-biased estimator of the target parameter which typically will converge at the fastest possible 1/root(n) rate and be approximately unbiased and normal, and from which valid confidence intervals for these parameters of interest may be constructed. The resulting method thus could be called a "double/de-biased ML" method because it relies on estimating primary and auxiliary predictive models to overcome regularization biases. In order to avoid overfitting, our construction also makes use of the K-fold sample splitting, which we call cross-fitting. This allows us to use a very broad set of ML predictive methods in solving the auxiliary and main prediction problems, such as random forest, lasso, ridge, deep neural nets, boosted trees, as well as various hybrids and aggregators of these methods.

**Abstract:**
I'll present about tools for statistical inference conditioned on model selection events that are defined by the generalized lasso regularization path in Tibshirani & Taylor (2011) [1], using recent advances in post-selection inference from Lee et al. (2016) [2], Tibshirani et al. (2016) [3]. These tools allow for exact hypothesis tests and confidence intervals for linear contrasts of the underlying mean vector, conditioned on any model selection event along the generalized lasso path (assuming Gaussian errors in the observations). Models covered by the generalized lasso included such as the fused lasso, trend filtering, and the graph fused lasso. In the fused lasso case, the underlying coordinates of the mean are assigned a linear ordering, and our framework allows us to test selectively chosen breakpoints or changepoints in these mean coordinates. This idea extends to changepoint models recovered by segmentation algorithms, and beyond Gaussian errors, and also to goodness-of-fit tests instead of targeted tests of linear contrasts as described above -- I'll describe these if time permits. This is work by me and my advisors Ryan Tibshirani and Max G'sell [4].

- https://arxiv.org/pdf/1005.1971.pdf
- https://arxiv.org/pdf/1311.6238.pdf
- https://arxiv.org/pdf/1401.3889.pdf
- https://arxiv.org/pdf/1606.03552.pdf

**Abstract:**
I will discuss the problem of estimating structural equation models from high-dimensional Gaussian data with p ≫ n. The main difficulty in establishing statistical guarantees in this setting arises from the nonidentifiability, nonsmoothness, and nonconvexity of the underlying M-estimator (aka score-based estimator). I will discuss how to establish nonasymptotic deviation bounds on the estimation error, sparsity bounds, and model selection consistency for a penalized least squares estimator. The proofs rely on interpreting the graphical model as a recursive linear structural equation model, which reduces the estimation problem to a series of tractable neighbourhood regressions, allowing us to avoid making any assumptions regarding identifiability, irrepresentability, or faithfulness. The techniques employed here provide insight into and can be used more broadly for general nonidentifiable and nonconvex problems.
Reference: Mostly based on this preprint.

**Abstract:**
Machine learning (ML) has become one of the most powerful classes of tools for artificial intelligence, personalized web services and data science problems across fields. However, the use of ML on sensitive data sets involving medical, financial and behavioral data are greatly limited due to privacy concern. In this talk, we consider the problem of statistical learning with privacy constraints. Under Vapnik's general learning setting and the formalism of differential privacy (DP), we establish simple conditions that characterizes the private learnability, which reveals a mixture of positive and negative insight. We then identify generic methods that reuse existing randomness to effectively solve private learning in practice; and discuss a weaker notion of privacy — on-avg KL-privacy — that allows for orders-of-magnitude more favorable privacy-utility tradeoff, while preserving key properties of differential privacy. Moreover, we show that On-Average KL-Privacy is **equivalent** to generalization for a large class of commonly-used tools in statistics and machine learning that sample from Gibbs distributions---a class of distributions that arises naturally from the maximum entropy principle. Finally, I will describe a few exciting future directions that use statistics/machine learning tools to advance he state-of-the-art for privacy, and use privacy (and privacy inspired techniques) to formally address the problem of p-hacking in scientific discovery.
**References:**

- Yu-Xiang Wang, Jing Lei, and Stephen E. Fienberg. "Learning with differential privacy: Stability, learnability and the sufficiency and necessity of ERM principle." *Journal of Machine Learning Research* 17.183 (2016): 1-40. [PDF] jmlr.org
- Yu-Xiang Wang, Stephen E. Fienberg, and Alexander J. Smola. "Privacy for Free: Posterior Sampling and Stochastic Gradient Monte Carlo." *ICML*. 2015. [PDF] jmlr.org
- Yu-Xiang Wang, Jing Lei, and Stephen E. Fienberg. "On-Average KL-Privacy and Its Equivalence to Generalization for Max-Entropy Mechanisms." *International Conference on Privacy in Statistical Databases*. Springer International Publishing, 2016. [PDF] arxiv.org

**Abstract:**
Phylogenetic trees, which is inferring evolutionary histories, has important applications in biology, criminology and public health. This talk presents a continuous space which models the set of all phylogenetic trees having a fixed set of leaves. This space has a natural metric of non-positive curvature, allowing a valid procedure for averaging or combining several trees whose leaves are identical. This talk also presents limiting behavior of such averaged trees.

**Abstract:**
We consider the problem of estimating a function defined over n locations on a d-dimensional grid
(having all side lengths equal to n^1/d). When the function is constrained to have discrete total variation
bounded by Cn, we derive the minimax optimal (squared) \ell_2 estimation error rate, parametrized by n and
Cn. Total variation denoising, also known as the fused lasso, is seen to be rate optimal. Several simpler
estimators exist, such as Laplacian smoothing and Laplacian eigenmaps. A natural question is: can these
simpler estimators perform just as well? We prove that these estimators, and more broadly all estimators
given by linear transformations of the input data, are suboptimal over the class of functions with bounded
variation. This extends fundamental findings of Donoho and Johnstone [1998] on 1-dimensional total
variation spaces to higher dimensions. The implication is that the computationally simpler methods cannot
be used for such sophisticated denoising tasks, without sacrificing statistical accuracy. We also derive
minimax rates for discrete Sobolev spaces over d-dimensional grids, which are, in some sense, smaller
than the total variation function spaces. Indeed, these are small enough spaces that linear estimators can
be optimal—and a few well-known ones are, such as Laplacian smoothing and Laplacian eigenmaps, as
we show.

Relevant paper.

**Abstract:**
Nearly all estimators in statistical prediction come with an associated tuning parameter, in one way or another. Common practice, given data, is to choose the tuning parameter value that minimizes a constructed estimate of the prediction error of the estimator. Of course, estimating prediction error has a long history in statistics, and many methods have been proposed for this problem; we focus on Stein’s unbiased risk estimator, or SURE (Stein, 1981; Efron, 1986), which forms an unbiased estimate of the prediction error by augmenting the observed training error with an estimate of the degrees of freedom of our estimator.

Parameter tuning via SURE minimization has been advocated by many authors, in a wide variety of problem settings. In general, it is natural to ask: what is the prediction error of the SURE-tuned estimator? The most obvious estimate for this quantity is the value of the SURE criterion at its minimum. However, this is no longer itself unbiased; in fact, we would expect the minimum of SURE to be systematically biased downwards as an estimate of the prediction error of the SURE-tuned estimator. We formally describe and study this bias.

This is based on work that is in progress, with Saharon Rosset.

**Abstract:**
A number of machine learning tasks relies on spectral properties of the graph laplacian associated to the data. I will consider data points obtained as random samples of a measure on a manifold in $R^d$. I will discuss conditions under which the spectrum of the graph Laplacians on a neighborhood graph spanned by the samples converges almost surely to the spectrum of appropriate weighted Laplace-Beltrami operators on the manifold, as the sample size increases and the neighborhood size shrinks to zero. I will present error estimates for the convergence that explicitly depend on the geometry of the manifold, the number of data points available and the size of the neighborhood used in the graph construction.
This is based on work that is in preparation.

**Abstract:**
The estimation of time-varying networks for functional Magnetic Resonance Imaging (fMRI) data sets is of increasing importance and interest. In this work, we formulate the problem in a high-dimensional time series framework and introduce a data-driven method, namely Network Change Points Detection (NCPD), which detects change points in the network structure of a multivariate time series, with each component of the time series represented by a node in the network. NCPD is applied to various simulated data and a resting-state fMRI data set. This new methodology also allows us to identify common functional states within and across subjects. Finally, NCPD promises to offer a deep insight into the large-scale characterisations and dynamics of the brain. This is joint work with Ivor Cribben (Alberta School of Business).

**Abstract:**
I will continue the topic that I introduced last semester on computationally efficient experiment selection in regression models, with additional results and analysis. I will analyze a sampling based method and a greedy method, both of which attain provable near-optimal statistical performance in polynomial time while enjoy different finite sample performance.

**Abstract:**
The availability of high-dimensional data in statistical applications has created an urgent need for methodologies to pursue sparse and/or low rank models. To provide a proper amount of shrinkage, most statistical approaches use a grid search with a model comparison criterion to locate proper values of the regularization parameters. We study cross-validation for multivariate models where relevant features may lie in a low dimensional subspace. By cross-validating candidate projection-selection patterns instead of regularization parameters, we are able to link cross-validation to a class of information criteria. A scale-free rate calibration helps cross-validation achieve non-asymptotic optimality in prediction.

**Abstract:**
I present a theory of point and interval estimation for nonlinear functionals in parametric, semi-, and non-parametric models based on higher order influence functions. The theory reproduces many previous results, produces new non-root n results, and opens up the ability to perform optimal non-root n inference in complex high dimensional models. We present novel rate-optimal point and intervals estimators for various functionals of central importance to statistics and biostatistics in settings in which estimation at the expected root n rate is not possible, owing to the curse of dimensionality.

**Abstract:**
This paper studies how to capture dependency graph structures from real data which may not be multivariate Gaussian. Starting from marginal loss functions not necessarily derived from probability distributions, we use an additive over-parametrization with shrinkage to incorporate variable dependencies into the criterion. An iterative Gaussian graph learning algorithm is proposed with ease in implementation. Statistical analysis shows that with the error measured in terms of a proper Bregman divergence, the estimators have fast rate of convergence. Real-life examples in different settings are given to demonstrate the efficacy of the proposed methodology.

**Abstract:**
I’ll be presenting some of my recent work on statistical matching - the problem of estimating a joint distribution or properties of it when many pairs of random variables are never sampled simultaneously. It is well established that statistical matching encounters problems of non-identifiability unless certain assumptions, such as conditional independence of any pair of variables never sampled together, hold. In this work, we reexamine the conditions under which models can be identified in the statistical matching scenario. We show that latent variable models, such as factor analysis and latent trait models, while violating the typical conditional independence assumptions appealed to in statistical matching, are nonetheless, generically identifiable. Intuitively, generic identifiability establishes identifiability for sets of randomly generated parameters, and we present conditions establishing when latent variable models are generically identifiable and unidentifiable in the statistical matching scenario. I’ll also spend some time describing real neuroscience analyses which motivate the use of statistical matching. This is joint work with Byron Yu and Geoff Gordon that we are currently writing up (so no paper yet), and we would love to get your feedback.

**Abstract:**
Sketching techniques have become popular for scaling up machine
learning algorithms by reducing the sample size or dimensionality of
massive data sets, while still maintaining the statistical power of
big data. We study sketching from an optimization point of view. We
first show that the iterative Hessian sketch is an optimization
process with preconditioning, and develop accelerated iterative
Hessian sketch via the searching the conjugate direction; we then
establish primal-dual connections between the Hessian sketch and dual
random projection, and apply the preconditioned conjugate gradient
approach on the dual problem, which leads to the acclerated iterative
dual random projection methods. Finally to tackle the challenges from
both large sample size and high-dimensionality, we propose the
primal-dual sketch, which iteratively sketches the primal and dual
formulations.

Joint work with Jialei Wang, Jason D. Lee, Mehrdad Mahdavi and Nati Srebro

**Abstract:**
It is commonplace to encounter nonstationary data, of which the underlying generating process may change over time or across domains. The nonstationarity presents both challenges and opportunities for causal discovery. In this paper we propose a principled framework to handle nonstationarity, and develop methods to address three important questions. First, we propose an enhanced constraint-based method to detect variables whose local mechanisms are nonstationary and recover the skeleton of the causal structure over observed variables. Second, we present a way to determine some causal directions by taking advantage of information carried by changing distributions. Third, we develop a method for visualizing the nonstationarity of local mechanisms. Experimental results on various synthetic and real-world datasets are presented to demonstrate the efficacy of our methods.

Relevant paper: https://arxiv.org/abs/1509.08056

**Abstract:**
In many real world problems such as online advertisement, web search, movie recommendations, personalized medical treatment. One can only observe outcomes of the actions (be it ads, search results, movies or treatments) that were taken. Let the algorithm used to generate these actions be a “policy”, we consider the problem of **off-policy** evaluation, where we collect data using a policy and then we try to evaluate the performance of a different policy. In other word, this is to answer the “What-If” question: what if the other policy was deployed, what would the outcomes be?
In this talk, I will do the following:

- Formulate the problem as a minimax-estimation problem and states its minimax risk. The result suggests that the simplest possible approach: “Importance Sampling” a.k.a “Inverse Propensity Scoring” in some sense cannot be improved, if no model assumptions are made.
- Illustrate that the state-of-the-art “Doubly Robust” estimator that incorporates model-based approaches is strictly suboptimal.
- Describe a new adaptive estimator that overcomes the issues with “Doubly Robust”.

Relevant slides: http://www.cs.cmu.edu/~yuxiangw/docs/minimax_eval_talk.pdf

**Abstract:**
Continuous treatments (e.g., doses) arise often in practice, but many available causal effect estimators are limited by either requiring parametric models for the effect curve, or by not allowing doubly robust covariate adjustment. We develop a novel kernel smoothing approach that requires only mild smoothness assumptions on the effect curve, and still allows for misspecification of either the treatment density or outcome regression. We derive asymptotic properties and give a procedure for data-driven bandwidth selection. The methods are illustrated via simulation and in a study of the effect of nurse staffing on hospital readmissions penalties.

Preprint: http://arxiv.org/abs/1507.00747

**Abstract:**
In this talk I will introduce a framework for constructing goodness of fit tests in both low and high-dimensional linear models. The idea involves applying regression methods to the scaled residuals following either an ordinary least squares or Lasso fit to the data, and using some proxy for prediction error as the final test statistic. We call this family Residual Prediction (RP) tests. We show that simulation can be used to obtain the critical values for such tests in the low-dimensional setting, and demonstrate that some form of the parametric bootstrap can do the same when the high-dimensional linear model is under consideration. We show that RP tests can be used to test for significance of groups or individual variables as special cases, and here they compare favourably with state of the art methods, but we also argue that they can be designed to test for as diverse model misspecifications as heteroscedasticity and different types of nonlinearity.
This is joint work with Peter Bühlmann.

Preprint: http://www.statslab.cam.ac.uk/~rds37/papers/RPtests

**Abstract:**
Modern learning algorithms are often seen as prediction-only tools, meaning that the interpretability and intuition provided by a more traditional modeling approach are sacrificed in order to achieve superior predictions. In this talk, we argue that this black-box perspective need not always be the case and develop formal statistical inference procedures for predictions generated by supervised learning ensembles. Ensemble methods based on bootstrapping, such as bagging and random forests, usually improve the predictive accuracy of individual trees, but fail to provide a framework in which distributional results can be easily determined. Instead of aggregating full bootstrap samples, we consider predicting by averaging over trees built on subsamples of the training set and demonstrate that the resulting estimator takes the form of a U-statistic. As such, predictions for individual feature vectors are asymptotically normal, allowing for confidence intervals to accompany predictions. In practice, a subset of subsamples is used for computational speed; here our estimators take the form of incomplete U-statistics and equivalent results are derived. We further demonstrate that this setup provides a framework for testing the significance of features. Moreover, the internal estimation method we develop allows us to estimate the variance parameters and perform these inference procedures at no additional computational cost. Demonstrations are provided using data from the ebird project hosted at Cornell University.

Here is the link to the relevant paper: http://jmlr.org/papers/v17/14-168.html