Welcome to the homepage of the **Statistics and Machine Learning Working Group** at **Carnegie Mellon University**!

We are group of faculty and students in Statistics and Machine Learning broadly interested in research at the intersection of these two disciplines.

Unless otherwise notified, our regular meeting is on **Weds 2-3pm in GHC-8102** every week. Please email *vsadhana AT cs.cmu.edu* if you would like join our mailing list.

If you would like to present in an upcoming meeting, please signup here.

Topics of choice are flexible. As a guideline, here is a list of interesting papers that we hope to read this semster.

**Abstract:**
I'll present about tools for statistical inference conditioned on model selection events that are defined by the generalized lasso regularization path in Tibshirani & Taylor (2011) [1], using recent advances in post-selection inference from Lee et al. (2016) [2], Tibshirani et al. (2016) [3]. These tools allow for exact hypothesis tests and confidence intervals for linear contrasts of the underlying mean vector, conditioned on any model selection event along the generalized lasso path (assuming Gaussian errors in the observations). Models covered by the generalized lasso included such as the fused lasso, trend filtering, and the graph fused lasso. In the fused lasso case, the underlying coordinates of the mean are assigned a linear ordering, and our framework allows us to test selectively chosen breakpoints or changepoints in these mean coordinates. This idea extends to changepoint models recovered by segmentation algorithms, and beyond Gaussian errors, and also to goodness-of-fit tests instead of targeted tests of linear contrasts as described above -- I'll describe these if time permits. This is work by me and my advisors Ryan Tibshirani and Max G'sell [4].

- https://arxiv.org/pdf/1005.1971.pdf
- https://arxiv.org/pdf/1311.6238.pdf
- https://arxiv.org/pdf/1401.3889.pdf
- https://arxiv.org/pdf/1606.03552.pdf

**Abstract:**
I will discuss the problem of estimating structural equation models from high-dimensional Gaussian data with p ≫ n. The main difficulty in establishing statistical guarantees in this setting arises from the nonidentifiability, nonsmoothness, and nonconvexity of the underlying M-estimator (aka score-based estimator). I will discuss how to establish nonasymptotic deviation bounds on the estimation error, sparsity bounds, and model selection consistency for a penalized least squares estimator. The proofs rely on interpreting the graphical model as a recursive linear structural equation model, which reduces the estimation problem to a series of tractable neighbourhood regressions, allowing us to avoid making any assumptions regarding identifiability, irrepresentability, or faithfulness. The techniques employed here provide insight into and can be used more broadly for general nonidentifiable and nonconvex problems.
Reference: Mostly based on this preprint.

**Abstract:**
Machine learning (ML) has become one of the most powerful classes of tools for artificial intelligence, personalized web services and data science problems across fields. However, the use of ML on sensitive data sets involving medical, financial and behavioral data are greatly limited due to privacy concern. In this talk, we consider the problem of statistical learning with privacy constraints. Under Vapnik's general learning setting and the formalism of differential privacy (DP), we establish simple conditions that characterizes the private learnability, which reveals a mixture of positive and negative insight. We then identify generic methods that reuse existing randomness to effectively solve private learning in practice; and discuss a weaker notion of privacy — on-avg KL-privacy — that allows for orders-of-magnitude more favorable privacy-utility tradeoff, while preserving key properties of differential privacy. Moreover, we show that On-Average KL-Privacy is **equivalent** to generalization for a large class of commonly-used tools in statistics and machine learning that sample from Gibbs distributions---a class of distributions that arises naturally from the maximum entropy principle. Finally, I will describe a few exciting future directions that use statistics/machine learning tools to advance he state-of-the-art for privacy, and use privacy (and privacy inspired techniques) to formally address the problem of p-hacking in scientific discovery.
**References:**

- Yu-Xiang Wang, Jing Lei, and Stephen E. Fienberg. "Learning with differential privacy: Stability, learnability and the sufficiency and necessity of ERM principle." *Journal of Machine Learning Research* 17.183 (2016): 1-40. [PDF] jmlr.org
- Yu-Xiang Wang, Stephen E. Fienberg, and Alexander J. Smola. "Privacy for Free: Posterior Sampling and Stochastic Gradient Monte Carlo." *ICML*. 2015. [PDF] jmlr.org
- Yu-Xiang Wang, Jing Lei, and Stephen E. Fienberg. "On-Average KL-Privacy and Its Equivalence to Generalization for Max-Entropy Mechanisms." *International Conference on Privacy in Statistical Databases*. Springer International Publishing, 2016. [PDF] arxiv.org

**Abstract:**
Phylogenetic trees, which is inferring evolutionary histories, has important applications in biology, criminology and public health. This talk presents a continuous space which models the set of all phylogenetic trees having a fixed set of leaves. This space has a natural metric of non-positive curvature, allowing a valid procedure for averaging or combining several trees whose leaves are identical. This talk also presents limiting behavior of such averaged trees.

**Abstract:**
We consider the problem of estimating a function defined over n locations on a d-dimensional grid
(having all side lengths equal to n^1/d). When the function is constrained to have discrete total variation
bounded by Cn, we derive the minimax optimal (squared) \ell_2 estimation error rate, parametrized by n and
Cn. Total variation denoising, also known as the fused lasso, is seen to be rate optimal. Several simpler
estimators exist, such as Laplacian smoothing and Laplacian eigenmaps. A natural question is: can these
simpler estimators perform just as well? We prove that these estimators, and more broadly all estimators
given by linear transformations of the input data, are suboptimal over the class of functions with bounded
variation. This extends fundamental findings of Donoho and Johnstone [1998] on 1-dimensional total
variation spaces to higher dimensions. The implication is that the computationally simpler methods cannot
be used for such sophisticated denoising tasks, without sacrificing statistical accuracy. We also derive
minimax rates for discrete Sobolev spaces over d-dimensional grids, which are, in some sense, smaller
than the total variation function spaces. Indeed, these are small enough spaces that linear estimators can
be optimal—and a few well-known ones are, such as Laplacian smoothing and Laplacian eigenmaps, as
we show.

Relevant paper.

**Abstract:**
The estimation of time-varying networks for functional Magnetic Resonance Imaging (fMRI) data sets is of increasing importance and interest. In this work, we formulate the problem in a high-dimensional time series framework and introduce a data-driven method, namely Network Change Points Detection (NCPD), which detects change points in the network structure of a multivariate time series, with each component of the time series represented by a node in the network. NCPD is applied to various simulated data and a resting-state fMRI data set. This new methodology also allows us to identify common functional states within and across subjects. Finally, NCPD promises to offer a deep insight into the large-scale characterisations and dynamics of the brain. This is joint work with Ivor Cribben (Alberta School of Business).

**Abstract:**
I will continue the topic that I introduced last semester on computationally efficient experiment selection in regression models, with additional results and analysis. I will analyze a sampling based method and a greedy method, both of which attain provable near-optimal statistical performance in polynomial time while enjoy different finite sample performance.

**Abstract:**
The availability of high-dimensional data in statistical applications has created an urgent need for methodologies to pursue sparse and/or low rank models. To provide a proper amount of shrinkage, most statistical approaches use a grid search with a model comparison criterion to locate proper values of the regularization parameters. We study cross-validation for multivariate models where relevant features may lie in a low dimensional subspace. By cross-validating candidate projection-selection patterns instead of regularization parameters, we are able to link cross-validation to a class of information criteria. A scale-free rate calibration helps cross-validation achieve non-asymptotic optimality in prediction.

**Abstract:**
I present a theory of point and interval estimation for nonlinear functionals in parametric, semi-, and non-parametric models based on higher order influence functions. The theory reproduces many previous results, produces new non-root n results, and opens up the ability to perform optimal non-root n inference in complex high dimensional models. We present novel rate-optimal point and intervals estimators for various functionals of central importance to statistics and biostatistics in settings in which estimation at the expected root n rate is not possible, owing to the curse of dimensionality.

**Abstract:**
This paper studies how to capture dependency graph structures from real data which may not be multivariate Gaussian. Starting from marginal loss functions not necessarily derived from probability distributions, we use an additive over-parametrization with shrinkage to incorporate variable dependencies into the criterion. An iterative Gaussian graph learning algorithm is proposed with ease in implementation. Statistical analysis shows that with the error measured in terms of a proper Bregman divergence, the estimators have fast rate of convergence. Real-life examples in different settings are given to demonstrate the efficacy of the proposed methodology.

**Abstract:**
I’ll be presenting some of my recent work on statistical matching - the problem of estimating a joint distribution or properties of it when many pairs of random variables are never sampled simultaneously. It is well established that statistical matching encounters problems of non-identifiability unless certain assumptions, such as conditional independence of any pair of variables never sampled together, hold. In this work, we reexamine the conditions under which models can be identified in the statistical matching scenario. We show that latent variable models, such as factor analysis and latent trait models, while violating the typical conditional independence assumptions appealed to in statistical matching, are nonetheless, generically identifiable. Intuitively, generic identifiability establishes identifiability for sets of randomly generated parameters, and we present conditions establishing when latent variable models are generically identifiable and unidentifiable in the statistical matching scenario. I’ll also spend some time describing real neuroscience analyses which motivate the use of statistical matching. This is joint work with Byron Yu and Geoff Gordon that we are currently writing up (so no paper yet), and we would love to get your feedback.

**Abstract:**
Sketching techniques have become popular for scaling up machine
learning algorithms by reducing the sample size or dimensionality of
massive data sets, while still maintaining the statistical power of
big data. We study sketching from an optimization point of view. We
first show that the iterative Hessian sketch is an optimization
process with preconditioning, and develop accelerated iterative
Hessian sketch via the searching the conjugate direction; we then
establish primal-dual connections between the Hessian sketch and dual
random projection, and apply the preconditioned conjugate gradient
approach on the dual problem, which leads to the acclerated iterative
dual random projection methods. Finally to tackle the challenges from
both large sample size and high-dimensionality, we propose the
primal-dual sketch, which iteratively sketches the primal and dual
formulations.

Joint work with Jialei Wang, Jason D. Lee, Mehrdad Mahdavi and Nati Srebro

**Abstract:**
It is commonplace to encounter nonstationary data, of which the underlying generating process may change over time or across domains. The nonstationarity presents both challenges and opportunities for causal discovery. In this paper we propose a principled framework to handle nonstationarity, and develop methods to address three important questions. First, we propose an enhanced constraint-based method to detect variables whose local mechanisms are nonstationary and recover the skeleton of the causal structure over observed variables. Second, we present a way to determine some causal directions by taking advantage of information carried by changing distributions. Third, we develop a method for visualizing the nonstationarity of local mechanisms. Experimental results on various synthetic and real-world datasets are presented to demonstrate the efficacy of our methods.

Relevant paper: https://arxiv.org/abs/1509.08056

**Abstract:**
In many real world problems such as online advertisement, web search, movie recommendations, personalized medical treatment. One can only observe outcomes of the actions (be it ads, search results, movies or treatments) that were taken. Let the algorithm used to generate these actions be a “policy”, we consider the problem of **off-policy** evaluation, where we collect data using a policy and then we try to evaluate the performance of a different policy. In other word, this is to answer the “What-If” question: what if the other policy was deployed, what would the outcomes be?
In this talk, I will do the following:

- Formulate the problem as a minimax-estimation problem and states its minimax risk. The result suggests that the simplest possible approach: “Importance Sampling” a.k.a “Inverse Propensity Scoring” in some sense cannot be improved, if no model assumptions are made.
- Illustrate that the state-of-the-art “Doubly Robust” estimator that incorporates model-based approaches is strictly suboptimal.
- Describe a new adaptive estimator that overcomes the issues with “Doubly Robust”.

Relevant slides: http://www.cs.cmu.edu/~yuxiangw/docs/minimax_eval_talk.pdf

**Abstract:**
Continuous treatments (e.g., doses) arise often in practice, but many available causal effect estimators are limited by either requiring parametric models for the effect curve, or by not allowing doubly robust covariate adjustment. We develop a novel kernel smoothing approach that requires only mild smoothness assumptions on the effect curve, and still allows for misspecification of either the treatment density or outcome regression. We derive asymptotic properties and give a procedure for data-driven bandwidth selection. The methods are illustrated via simulation and in a study of the effect of nurse staffing on hospital readmissions penalties.

Preprint: http://arxiv.org/abs/1507.00747

**Abstract:**
In this talk I will introduce a framework for constructing goodness of fit tests in both low and high-dimensional linear models. The idea involves applying regression methods to the scaled residuals following either an ordinary least squares or Lasso fit to the data, and using some proxy for prediction error as the final test statistic. We call this family Residual Prediction (RP) tests. We show that simulation can be used to obtain the critical values for such tests in the low-dimensional setting, and demonstrate that some form of the parametric bootstrap can do the same when the high-dimensional linear model is under consideration. We show that RP tests can be used to test for significance of groups or individual variables as special cases, and here they compare favourably with state of the art methods, but we also argue that they can be designed to test for as diverse model misspecifications as heteroscedasticity and different types of nonlinearity.
This is joint work with Peter Bühlmann.

Preprint: http://www.statslab.cam.ac.uk/~rds37/papers/RPtests

**Abstract:**
Modern learning algorithms are often seen as prediction-only tools, meaning that the interpretability and intuition provided by a more traditional modeling approach are sacrificed in order to achieve superior predictions. In this talk, we argue that this black-box perspective need not always be the case and develop formal statistical inference procedures for predictions generated by supervised learning ensembles. Ensemble methods based on bootstrapping, such as bagging and random forests, usually improve the predictive accuracy of individual trees, but fail to provide a framework in which distributional results can be easily determined. Instead of aggregating full bootstrap samples, we consider predicting by averaging over trees built on subsamples of the training set and demonstrate that the resulting estimator takes the form of a U-statistic. As such, predictions for individual feature vectors are asymptotically normal, allowing for confidence intervals to accompany predictions. In practice, a subset of subsamples is used for computational speed; here our estimators take the form of incomplete U-statistics and equivalent results are derived. We further demonstrate that this setup provides a framework for testing the significance of features. Moreover, the internal estimation method we develop allows us to estimate the variance parameters and perform these inference procedures at no additional computational cost. Demonstrations are provided using data from the ebird project hosted at Cornell University.

Here is the link to the relevant paper: http://jmlr.org/papers/v17/14-168.html