Welcome to the homepage of the
Statistics and Machine Learning Research Group at Carnegie Mellon University!

We are group of faculty and students in Statistics and Machine Learning broadly interested in research at the intersection of these two disciplines.

Unless otherwise notified, our regular meeting is on Thurs 3-4pm in NSH-3305 every week. Please email ynwang dot yining at gmail dot com if you would like join our mailing list.


If you would like to present in an upcoming meeting, please signup here.

Topics of choice are flexible. As a guideline, here is a list of interesting papers that we hope to read this semster.


Talks:


Asymptotics of objective functionals in semi-supervised learning

Oct 26 (Thursday) at 3pm in NSH-3305
Speaker: Simon Wilson

Abstract We consider a regression problem of semi-supervised learning: given real-valued labels on a small subset of data recover the function on the whole data set while taking into account the information provided by a large number of unlabeled data points. Objective functionals modeling this regression problem involve terms rewarding the regularity of the function estimate while enforcing agreement with the labels provided. We will discuss regularizations motivated by p-Laplace equation. We will discuss and prove which of these functionals make sense when the number of data points goes to infinity. The talk is based on joint work with Matthew Thorpe (arXiv:1707.06213).

Random closed sets and their expectation

Oct 19 (Thursday) at 3pm in NSH-3305
Speaker: Jaehyeok Shin

Abstract In this talk, I will review some of basic concepts in the random closed set theory. Specifically, I will give a brief review of two intuitive definitions of expectations of random closed sets – the Vorob’ev expectation and the ODF expectation. The former minimizes the expected measure of the symmetric difference between the random set and its expectation. I will discuss how one might possibly use this property to construct a representative, non-random predictive region from observed random predictive regions. The latter has some attractive properties for shape and boundary estimation problem. These properties include inclusion relations, convexity preservation, and equivariance with respect to rigid motions. Reference:
  1. Molchanov, Ilya. Theory of random sets. Springer Science Business Media, 2006
  2. Jankowski, Hanna K., and Larissa I. Stanberry. "Expectations of random sets and their boundaries using oriented distance functions." Journal of Mathematical Imaging and Vision 36.3 (2010): 291-303.

A new "permutation-based" look at noisy non-negative matrix completion

Oct 12 (Thursday) at 3pm in NSH-3305
Speaker: Nihar Shah

Abstract Noisy matrix non-negative matrix completion involves reconstructing a structured matrix whose entries are partially observed in noise. Standard approaches to this problem are based on assuming that the underlying matrix has a low (non-negative) rank. We first describe how this classical non-negative rank model enforces restrictions that may be quite undesirable in practice. We propose a richer model based on what we term the "permutation-rank" of a matrix, and show how these restrictions can be avoided by using this richer model. Second, we establish the minimax rates of estimation under the new permutation-based model, and prove that surprisingly, the minimax rates are equivalent up to logarithmic factors to those for estimation under the typical low rank model. We also analyze a computationally efficient singular-value-thresholding algorithm, known to be optimal for the low-rank setting, and show that it also simultaneously yields a consistent estimator for the low-permutation rank setting.

Philosophy of Science, Principled Statistical Inference, and Data Science

Oct 5 (Thursday) at 3pm in NSH-3305
Speaker: Todd Kuffner

Abstract Statistical reasoning and statistical inference have strong historical connections with philosophy of science. In this talk, the new paradigm of data-driven science is examined through comparison with principled statistical approaches. I will review the merits and shortcomings of principled statistical inference. The talk will feature a case study of post-selection inference, recent progress regarding inference for black box algorithms, and a survey of future challenges.

Asymptotics of objective functionals in semi-supervised learning

Sep 28 (Thursday) at 3pm in NSH-3305
Speaker: Dejan Slepcev

Abstract We consider a regression problem of semi-supervised learning: given real-valued labels on a small subset of data recover the function on the whole data set while taking into account the information provided by a large number of unlabeled data points. Objective functionals modeling this regression problem involve terms rewarding the regularity of the function estimate while enforcing agreement with the labels provided. We will discuss regularizations motivated by p-Laplace equation. We will discuss and prove which of these functionals make sense when the number of data points goes to infinity. The talk is based on joint work with Matthew Thorpe (arXiv:1707.06213).

Locating the minimum of a function from adaptive queries

Sep 21 (Thursday) at 3pm in NSH-3305
Speaker: Yining Wang

Abstract I will discuss the question of locating the minimum of an unknown function from noisy adaptive queries. I will discuss why the problem does not make much sense for general non-parametric families like Holder classes, and why the problem becomes interesting with certain shape constraints such as convexity. I will review some results in both the machine learning and the statistics literature on convex regression and zeroth order optimization, and mention open questions.

Property testing in high dimensional Ising models

Sep 14 (Thursday) at 3pm in NSH-3305
Speaker: Matey Neykov

Abstract We will discuss the information-theoretic limitations of graph property testing in zero-field Ising models. Instead of learning the entire graph structure, sometimes testing a basic graph property such as connectivity, cycle presence or maximum clique size is a more relevant and attainable objective. Since property testing is more fundamental than graph recovery, any necessary conditions for property testing imply corresponding conditions for graph recovery, while custom property tests can be statistically and/or computationally more efficient than graph recovery based algorithms. Understanding the statistical complexity of property testing requires the distinction of ferromagnetic (i.e., positive interactions only) and general Ising models. Using combinatorial constructs such as graph packing and strong monotonicity, we characterize how target properties affect the corresponding minimax upper and lower bounds within the realm of ferromagnets. On the other hand, by studying the detection of an antiferromagnetic (i.e., negative interactions only) Curie-Weiss model burried in Rademacher noise, we show that property testing is strictly more challenging over general Ising models. We will also briefly discuss two types of correlation based tests: computationally efficient screening for ferromagnets, and "score type" tests for general models.

Depth-based nonparametric tests for homogeneity of functional data

Sep 7 (Thursday) at 3pm in NSH-3305
Speaker: Gery Geenens

Abstract In this work we study some tests for the homogeneity between two independent samples of functional data. The null hypothesis of "homogeneity" here means that the latent stochastic processes which generated the two samples have the same distribution. Most instances of functional data are so complex that it seems natural to opt for nonparametric procedures in this setting. Making use of recent developments on functional depths, we adapt some Kolmogorov-Smirnov- and Cramer-von-Mises-type of criteria to the functional context. Exact p-values for the test can be obtained via permutations, or, in case of too large samples, a bootstrap algorithm is easily implemented. Some real data examples are analyzed.

Best Subset Selection vs Lasso

Authors: Trevor Hastie, Rob Tibshirani and Ryan Tibshirani

Aug 31 (Thursday) at 3pm in NSH-3305
Speaker: Ryan Tibshirani

Abstract: In exciting new work, Bertsimas et al. (2016) showed that the classical best subset selection problem in regression modeling can be formulated as a mixed integer optimization (MIO) problem. Using recent advances in MIO algorithms, they demonstrated that best subset selection can now be solved at much larger problem sizes that what was thought possible in the statistics community. They presented empirical comparisons of best subset selection with other popular variable selection procedures, in particular, the lasso and forward stepwise selection. Surprisingly (to us), their simulations suggested that best subset selection consistently outperformed both methods in terms of prediction accuracy. Here we present an expanded set of simulations to shed more light on these comparisons. The summary is roughly as follows: (a) neither best subset selection nor the lasso uniformly dominate the other, with best subset selection generally performing better in high signal-to-noise (SNR) ratio regimes, and the lasso better in low SNR regimes; (b) best subset selection and forward stepwise perform quite similarly throughout; (c) the relaxed lasso (actually, a simplified version of the original relaxed estimator defined in Meinshausen, 2007) is the overall winner, performing just about as well as the lasso in low SNR scenarios, and as well as best subset selection in high SNR scenarios.

  1. https://arxiv.org/pdf/1707.08692.pdf (Hastie et al. discussion)
  2. https://arxiv.org/pdf/1507.03133.pdf (Bertsimas et al. original paper)