Speaker: Tim Coleman
Abstract: Ensemble methods are popular models for making predictions using many observations on predictors which may interact in complex ways. The regression function fit by ensemble methods is an average of many nonparametric base models, and is thus difficult to analyze. We propose a permutation test approach to feature significance in random forests (which can be extended easily to other bagged learners), which exploits the averaging nature of ensemble methods. In particular, we permute the individual trees of two random forests - one trained with a reduced dataset, and one trained with the full dataset. This method can be used to analyze predictions made by models at high numbers of test points at minimal extra cost. Moreover, the testing framework works with many existing implementation of random forests. We prove the asymptotic validity of the hypothesis test by exploring the connection between exchangeable random variables and ensemble methods. Further, we establish asymptotic normality for a wide variety of random forest metrics. Convergence of the permutation distribution to the null distribution, we avoid the difficulty of estimating the variance of random forests, which has inhibited the practical implementation of many distributional random forest results. Numerical results demonstrate that the test maintains Type I error validity and attains good power in practical random forest implementations.
Related paper: https://arxiv.org/pdf/1304.5939.pdf