Research Projects

Our research interests span a broad range of topics. Some of the main topics are highlighted below.

Active and Compressive Learning

Active and compressive queries are powerful tools that can help minimize sample complexity in many problems. Active queries refers to feedback-driven data collection and compressive queries refers to collection of data in a compressed representation. In this project, we are focusing on how to leverage these type of queries for efficiently learning structure in data in the form of clusters, graphs, subspaces and manifolds.

Under active query setting, this extends previous work which has mostly focused on "supervised" active learning where we can only choose whether or not to observe the label of a data point. In the unsupervised context, we are focusing on whether or not to observe a data point or a particular feature of a data point, given that the data has some (unknown) structure. The questions we are investigating include: how many active pairwise similarity measurements are needed for active clustering? how many active entries in a matrix or tensor are needed for completion and prototype selection? how many active samples are needed for learning a graphcial model? In these problems, what other benefits (better computation, storage, larger model class, ...) do active queries provide?

Under compressive query setting, we focus on active compressive queries where the subset of datapoints or features that are compressed can be adaptively changed based on feedback from prior measurements. Another theme is to consider a different compression operator per data point that can help minimize number of queries beyond what is possible with fixed compression provided there are enough data points.

This research has been supported in part by NSF CAREER award IIS-1247658, grants IIS-1252412, IIS-1116458, AFOSR Young Investigator award FA9550-14-1-0285 and grant FA9550-10-1-0382.


Some relevant papers:


Adaptive Nonparametric Estimation

"Trend filtering" refers to a family of estimators for locally adaptive nonparametric regression, defined by the minimization of a penalized least squares criterion. The solutions of this problem take the form of piecewise polynomial functions, with adaptively chosen knot points. Trend filtering balances the strengths of other common piecewise polynomial estimators: smoothing splines are highly computationally efficient but are not minimax optimal (for estimating functions whose derivatives are of bounded variation); locally adaptive regression splines are minimax optimal but are relatively inefficient in terms of computation; trend filtering is both minimax optimal and computationally comparable to smoothing splines.

Our current work is focused on extending the trend filtering framework to nonparametric estimation in high-dimensions, and over graphs.


Some relevant papers:


Shard Hunting

Shards are low dimensional structures in the ambient space that represent high density regions. Some common examples of shards include density level sets, density local modes, filaments and density ridges. Detecting shards from a given data is important in many scientific areas including astronomy, biology and neuroscience. Our work focus on (i) deriving asymptotic behavior for shards, (ii) establishing a valid procedure for statistical inference for shard hunting, and (iii) visualizing shards in high dimensions.


Some relevant papers:


Sparse Principal Component Analysis

Sparse principal component analysis (PCA) is a tool for simultaneous dimension reduction and variable selection in high dimensional multivariate data. In addition to finding a few directions in the high dimensional space that explain most of the variability in observed data, sparse PCA also expects that these directions involve only a few of the variables. The theoretical and methodological questions we are trying to answer include: (i) How does the assumption of sparsity help in accurately estimating the principal components? (ii) How to efficiently incorporate the sparsity structure into algorithms? (iii) How to tell if the sparsity assumption is appropriate for a given PCA problem? (iv) How to use the sparse PCA algorithm in other relevant problems, such as high dimensional clustering?

Some relevant papers:


Statistical Testing in High Dimensions

Two-sample testing deals with deciding if two random variables have the same distribution, given samples from both. This problem has applications in biology and neuroscience, and one is often interested in the nonparametric setting, when one does not make any parametric assumptions about the underlying distributions. We are trying to understand the behavior of various two-sample tests in the high-dimensional setting as well as other interesting directions including connecting two sample testing with classification, and creating provable sequential testing procedures.

Some relevant papers:

Another interesting direction that is being pursued is that of hypothesis testing after model selection. In regression, for example, many basic selection and estimation procedures like forward stepwise regression and the lasso are very well-understood in terms of their estimation capabilities, but not in terms of their inferential properties. Our work has focused on formulating test statistics for forward stepwise and the lasso, for the purpose of post model selection hypothesis testing, whose null distribution is known either exactly (in finite samples) or asymptotically, under a normal error model. Currently we are broadening this perspective to allow for non-normal errors or possibly even heteroskedastic errors, by calling upon tools like the bootstrap and sample splitting.

Some relevant papers:


Topological Data Analysis

Topological Data Analysis (TDA) is a collection of methods for extracting useful features from data. Clustering is a simple example of TDA. More sophisticated forms of TDA include manifold estimation, ridge hunting and persistent homology. The latter is concerned with multiscale methods for finding holes (low density regions) in point clouds.

Please visit http://www.stat.cmu.edu/topstat/ for more information.