Data blurring: sample splitting a single sample

24 Feb 2022, 4:00 PM, NSH 3305

Speaker: James Leiner

Abstract: Suppose we observe a random vector X from some distribution P in a known family with unknown parameters. We ask the following question: when is it possible to split X into two parts f(X) and g(X) such that neither part is sufficient to reconstruct X by itself, but both together can recover X fully, and the joint distribution of (f(X),g(X)) is tractable? As one example, if X=(X1,…,Xn) and P is a product distribution, then for any m less than n, we can split the sample to define f(X)=(X1,…,Xm) and g(X)=(Xm+1,…,Xn). Rasines and Young (2021) offers an alternative route of accomplishing this task through randomization of X with additive Gaussian noise which enables post-selection inference in finite samples for Gaussian distributed data and asymptotically for non-Gaussian additive models. In this paper, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data blurring, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on a few prototypical applications, such as post-selection inference for trend filtering and other regression problems. https://arxiv.org/abs/2112.11079