Welcome to the homepage of the **Statistics and Machine Learning Working Group** at **Carnegie Mellon University**!

We are group of faculty and students in Statistics and Machine Learning broadly interested in research at the intersection of these two disciplines.

Unless otherwise notified, our regular meeting is on **Tuesdays at noon in GHC-8102** every week. Please email *vsadhana AT cs.cmu.edu* if you would like join our mailing list.

Supported graciously by Microsoft Research.

If you would like to present in an upcoming meeting, please signup here.

Topics of choice are flexible. As a guideline, here is a list of interesting papers that we hope to read this semster.

**Abstract:**
The estimation of time-varying networks for functional Magnetic Resonance Imaging (fMRI) data sets is of increasing importance and interest. In this work, we formulate the problem in a high-dimensional time series framework and introduce a data-driven method, namely Network Change Points Detection (NCPD), which detects change points in the network structure of a multivariate time series, with each component of the time series represented by a node in the network. NCPD is applied to various simulated data and a resting-state fMRI data set. This new methodology also allows us to identify common functional states within and across subjects. Finally, NCPD promises to offer a deep insight into the large-scale characterisations and dynamics of the brain. This is joint work with Ivor Cribben (Alberta School of Business).

**Abstract:**
I will continue the topic that I introduced last semester on computationally efficient experiment selection in regression models, with additional results and analysis. I will analyze a sampling based method and a greedy method, both of which attain provable near-optimal statistical performance in polynomial time while enjoy different finite sample performance.

**Abstract:**
The availability of high-dimensional data in statistical applications has created an urgent need for methodologies to pursue sparse and/or low rank models. To provide a proper amount of shrinkage, most statistical approaches use a grid search with a model comparison criterion to locate proper values of the regularization parameters. We study cross-validation for multivariate models where relevant features may lie in a low dimensional subspace. By cross-validating candidate projection-selection patterns instead of regularization parameters, we are able to link cross-validation to a class of information criteria. A scale-free rate calibration helps cross-validation achieve non-asymptotic optimality in prediction.

**Abstract:**
I present a theory of point and interval estimation for nonlinear functionals in parametric, semi-, and non-parametric models based on higher order influence functions. The theory reproduces many previous results, produces new non-root n results, and opens up the ability to perform optimal non-root n inference in complex high dimensional models. We present novel rate-optimal point and intervals estimators for various functionals of central importance to statistics and biostatistics in settings in which estimation at the expected root n rate is not possible, owing to the curse of dimensionality.

**Abstract:**
This paper studies how to capture dependency graph structures from real data which may not be multivariate Gaussian. Starting from marginal loss functions not necessarily derived from probability distributions, we use an additive over-parametrization with shrinkage to incorporate variable dependencies into the criterion. An iterative Gaussian graph learning algorithm is proposed with ease in implementation. Statistical analysis shows that with the error measured in terms of a proper Bregman divergence, the estimators have fast rate of convergence. Real-life examples in different settings are given to demonstrate the efficacy of the proposed methodology.

**Abstract:**
I’ll be presenting some of my recent work on statistical matching - the problem of estimating a joint distribution or properties of it when many pairs of random variables are never sampled simultaneously. It is well established that statistical matching encounters problems of non-identifiability unless certain assumptions, such as conditional independence of any pair of variables never sampled together, hold. In this work, we reexamine the conditions under which models can be identified in the statistical matching scenario. We show that latent variable models, such as factor analysis and latent trait models, while violating the typical conditional independence assumptions appealed to in statistical matching, are nonetheless, generically identifiable. Intuitively, generic identifiability establishes identifiability for sets of randomly generated parameters, and we present conditions establishing when latent variable models are generically identifiable and unidentifiable in the statistical matching scenario. I’ll also spend some time describing real neuroscience analyses which motivate the use of statistical matching. This is joint work with Byron Yu and Geoff Gordon that we are currently writing up (so no paper yet), and we would love to get your feedback.

**Abstract:**
Sketching techniques have become popular for scaling up machine
learning algorithms by reducing the sample size or dimensionality of
massive data sets, while still maintaining the statistical power of
big data. We study sketching from an optimization point of view. We
first show that the iterative Hessian sketch is an optimization
process with preconditioning, and develop accelerated iterative
Hessian sketch via the searching the conjugate direction; we then
establish primal-dual connections between the Hessian sketch and dual
random projection, and apply the preconditioned conjugate gradient
approach on the dual problem, which leads to the acclerated iterative
dual random projection methods. Finally to tackle the challenges from
both large sample size and high-dimensionality, we propose the
primal-dual sketch, which iteratively sketches the primal and dual
formulations.

Joint work with Jialei Wang, Jason D. Lee, Mehrdad Mahdavi and Nati Srebro

**Abstract:**
It is commonplace to encounter nonstationary data, of which the underlying generating process may change over time or across domains. The nonstationarity presents both challenges and opportunities for causal discovery. In this paper we propose a principled framework to handle nonstationarity, and develop methods to address three important questions. First, we propose an enhanced constraint-based method to detect variables whose local mechanisms are nonstationary and recover the skeleton of the causal structure over observed variables. Second, we present a way to determine some causal directions by taking advantage of information carried by changing distributions. Third, we develop a method for visualizing the nonstationarity of local mechanisms. Experimental results on various synthetic and real-world datasets are presented to demonstrate the efficacy of our methods.

Relevant paper: https://arxiv.org/abs/1509.08056

**Abstract:**
In many real world problems such as online advertisement, web search, movie recommendations, personalized medical treatment. One can only observe outcomes of the actions (be it ads, search results, movies or treatments) that were taken. Let the algorithm used to generate these actions be a “policy”, we consider the problem of **off-policy** evaluation, where we collect data using a policy and then we try to evaluate the performance of a different policy. In other word, this is to answer the “What-If” question: what if the other policy was deployed, what would the outcomes be?
In this talk, I will do the following:

- Formulate the problem as a minimax-estimation problem and states its minimax risk. The result suggests that the simplest possible approach: “Importance Sampling” a.k.a “Inverse Propensity Scoring” in some sense cannot be improved, if no model assumptions are made.
- Illustrate that the state-of-the-art “Doubly Robust” estimator that incorporates model-based approaches is strictly suboptimal.
- Describe a new adaptive estimator that overcomes the issues with “Doubly Robust”.

Relevant slides: http://www.cs.cmu.edu/~yuxiangw/docs/minimax_eval_talk.pdf

**Abstract:**
Continuous treatments (e.g., doses) arise often in practice, but many available causal effect estimators are limited by either requiring parametric models for the effect curve, or by not allowing doubly robust covariate adjustment. We develop a novel kernel smoothing approach that requires only mild smoothness assumptions on the effect curve, and still allows for misspecification of either the treatment density or outcome regression. We derive asymptotic properties and give a procedure for data-driven bandwidth selection. The methods are illustrated via simulation and in a study of the effect of nurse staffing on hospital readmissions penalties.

Preprint: http://arxiv.org/abs/1507.00747

**Abstract:**
In this talk I will introduce a framework for constructing goodness of fit tests in both low and high-dimensional linear models. The idea involves applying regression methods to the scaled residuals following either an ordinary least squares or Lasso fit to the data, and using some proxy for prediction error as the final test statistic. We call this family Residual Prediction (RP) tests. We show that simulation can be used to obtain the critical values for such tests in the low-dimensional setting, and demonstrate that some form of the parametric bootstrap can do the same when the high-dimensional linear model is under consideration. We show that RP tests can be used to test for significance of groups or individual variables as special cases, and here they compare favourably with state of the art methods, but we also argue that they can be designed to test for as diverse model misspecifications as heteroscedasticity and different types of nonlinearity.
This is joint work with Peter Bühlmann.

Preprint: http://www.statslab.cam.ac.uk/~rds37/papers/RPtests

**Abstract:**
Modern learning algorithms are often seen as prediction-only tools, meaning that the interpretability and intuition provided by a more traditional modeling approach are sacrificed in order to achieve superior predictions. In this talk, we argue that this black-box perspective need not always be the case and develop formal statistical inference procedures for predictions generated by supervised learning ensembles. Ensemble methods based on bootstrapping, such as bagging and random forests, usually improve the predictive accuracy of individual trees, but fail to provide a framework in which distributional results can be easily determined. Instead of aggregating full bootstrap samples, we consider predicting by averaging over trees built on subsamples of the training set and demonstrate that the resulting estimator takes the form of a U-statistic. As such, predictions for individual feature vectors are asymptotically normal, allowing for confidence intervals to accompany predictions. In practice, a subset of subsamples is used for computational speed; here our estimators take the form of incomplete U-statistics and equivalent results are derived. We further demonstrate that this setup provides a framework for testing the significance of features. Moreover, the internal estimation method we develop allows us to estimate the variance parameters and perform these inference procedures at no additional computational cost. Demonstrations are provided using data from the ebird project hosted at Cornell University.

Here is the link to the relevant paper: http://jmlr.org/papers/v17/14-168.html