Speaker: Kirthevasan Kandasamy

**Abstract:**
We study a variant of the classical stochastic K-armed bandit problem where observing the outcome of each arm is expensive, but cheap approximations to this outcome are available. For example, when finding the optimal control policy of a robot, its expensive behaviour can be approximated by cheap computer simulations. Similarly, when minimising cross validation error to tune hyper-parameters of machine learning algorithms, cheaper training routines involving subsets of the data or early stopping can be used to approximate the cross validation curve of the entire dataset.
We formalise this task as a multi-fidelity bandit problem, where at each time step the forecaster may choose to play an arm at one of M fidelities. A play at fidelity M, i.e. the desired outcome, expends λ(M) of a resource whereas a play at fidelity m<M, i.e. an approximation, expends cost λ(m). We will be particularly interested in settings where λ(M)>> λ(m) for all m. We develop MF-UCB, a novel upper confidence bound procedure for this setting and prove that it naturally adapts to the sequence of available approximations. For instance, in the above robot control example, MF-UCB would use the cheap computer simulations to quickly eliminate suboptimal policies, while reserving real world trials for a small set of promising candidates. We derive lower bounds for this problem and show that MF-UCB is optimal on a wide range of settings.