Active Learning in Bandits and MDPs
A $K$-armed bandit problem involves designing a sampling strategy to identify the distribution (i.e., arm) with the largest mean. A related problem, called active learning in bandits considers the objective of learning the means of all the $K$ distributions uniformly well in terms of squared error ( (Antos et al., 2008)1 and (Carpentier et al., 2011)2. However, in many applications, it is required to learn the entire distribution in terms of some distance metric $D$, and not just the mean. For instance, consider the task of constructing an accurate model of an MDP given an exploration budget. This can be framed as the problem of learning the transition probability vectors corresponding to all state-action pairs uniformly well in $\ell_1$ distance. The techniques developed in prior work are not applicable to this problem.
Learning distributions with bandit feedback
In (Shekhar et al., 2019)3, we proposed a general sampling scheme for learning multiple distributions uniformly well in terms of several commonly used distance metrics such as $\ell_2^2$, total variation, $f$-divergence, and separation distance. Our main contributions were:
We began by studying an abstract version of this tracking problem, and proposed and analyzed a general optimistic tracking algorithm.
Next, we instantiated this algorithm for the specific problem instances arising in the case of the four distance metrics mentioned above, and obtained high probability bounds on the regret.
We showed that the allocation performance of our algorithm cannot be improved by deriving matching lower bounds on the allocation performance on any reasonable algorithm.
Finally, we showed through empirical evaluation that our proposed scheme works better than uniform sampling, greedy sampling and forced exploration sampling baselines.
Learning MDP models
An obstacle in applying the above ideas to learn MDP models, i.e., the $S \times A$ conditional distributions, is that we cannot observe arbitrary state-action trasitions in an MDP. This can be addressed by designing policies which spend an appropriate proportion of the time in different state-action pairs, as proposed in (Tarbouriech & Lazaric, 2019)4. In (Tarbouriech et al. 2020)5, we took this approach and proposed an algorithm and derived sample complexity results of estimating the model of a finite MDP with $\epsilon$ accuracy. Next, we also proposed a heuristic exploration strategy, based on weighted maximum entropy, which outperforms some baselines in experiments.