Contextual bandit and off-policy evaluation: minimax bounds and new algorithm

Speaker: Yu-Xiang Wang

Abstract: n many real world problems such as online advertisement, web search, movie recommendations, personalized medical treatment. One can only observe outcomes of the actions (be it ads, search results, movies or treatments) that were taken. Let the algorithm used to generate these actions be a “policy”, we consider the problem of **off-policy** evaluation, where we collect data using a policy and then we try to evaluate the performance of a different policy. In other word, this is to answer the “What-If” question: what if the other policy was deployed, what would the outcomes be? In this talk, I will do the following:

Formulate the problem as a minimax-estimation problem and states its minimax risk. The result suggests that the simplest possible approach: “Importance Sampling” a.k.a “Inverse Propensity Scoring” in some sense cannot be improved, if no model assumptions are made.
Illustrate that the state-of-the-art “Doubly Robust” estimator that incorporates model-based approaches is strictly suboptimal.
Describe a new adaptive estimator that overcomes the issues with “Doubly Robust”.

Relevant slides: http://www.cs.cmu.edu/~yuxiangw/docs/minimax_eval_talk.pdf