Lower Bound for Active Learning in Bandits via Le Cam’s Method

Shubhanshu Shekhar

Summary: We introduce LeCam’s method for obtaining minimax lower bounds for statistical estimation problems, which proceeds by relating the probability of error of a binary hypothesis testing problem to the total-variation distance between the two distributions. Then, as a novel application, we use this technique to derive a lower bound on the excess risk for the problem of active learning in bandits. This establishes the near-optimality of existing algorithms due to Antos et al. (2008) and Carpentier et al. (2011).


Consider the following setting:

Suppose we are given a sample XPX \sim P taking values in some set X\mathcal{X} for PPP \in \mathcal{P}, and let θ^:XD\hat{\theta}: \mathcal{X} \mapsto \mathcal{D} denote an estimator of θ(P)\theta(P). Then for any estimator θ^\hat{\theta} we can obtain the following lower bound on the maximum expected error:

Lemma 1 (Le Cam). With the definitions introduced above, we have supPPEP[d(θ^,θ(P))]δmaxPiPi(1dTV(P1,P2)). \sup_{P \in \mathcal{P}} \mathbb{E}_P \left[ d \left( \hat{\theta}, \, \theta(P) \right) \right] \geq \delta \max_{P_i \in \mathcal{P}_i} \left(1 - d_{TV}\left(P_1, P_2 \right) \right).

In the above display, dTVd_{TV} denotes the total variation distance between two distributions, defined as dTV(P,Q)=supEP(E)Q(E)d_{TV}(P, Q) = \sup_{E } P(E) - Q(E) where the supremum is over all measurable sets EE. The result above implies that the minmax lower bound for a particular estimation problem is large if there exist two distributions P1P_1 and P2P_2 which (i) are ‘well-separated’ in terms of the dd-metric, and (ii) are statistically ‘close’ in terms of dTV(,)d_{TV}(\cdot, \cdot). Due to their opposing nature, obtaining the best lower bound requires finding the right balance between these two requirements.

The statement of the above result and its proof are based on the statement and proof of Lemma~1 in (Yu, 1997)Yu, B. (1997). Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam.

.

Proof of the Lemma

First, we select arbitrary PiPiP_i \in \mathcal{P}_i for i=1,2i=1,2 and lower-bound the supremum over all PPP \in \mathcal{P} with a simple average over these two distributions.

supPPEP[d(θ^,θ(P))]12(EP1[d(θ^,θ(P1))]+EP2[d(θ^,θ(P2))]) \sup_{P \in \mathcal{P}}\mathbb{E}_P \left[ d \left( \hat{\theta}, \, \theta(P) \right) \right] \geq \frac{1}{2} \left( \mathbb{E}_{P_1} \left[ d \left( \hat{\theta}, \, \theta(P_1) \right) \right] + \mathbb{E}_{P_2} \left[ d \left( \hat{\theta}, \, \theta(P_2) \right) \right]\right ) Next, with the notation θi=θ(Pi)\theta_i = \theta(P_i), we observeIt is only at this point that we use the fact that dd satisfies the triangle inequality.

the following: for any θD\theta \in \mathcal{D} such that d(θ,θ1)<δd(\theta, \theta_1)< \delta, we must have d(θ,θ2)>δd(\theta, \theta_2)>\delta. Similarly, if d(θ,θ2)<δd(\theta, \theta_2)<\delta then d(θ,θ1)>δd(\theta, \theta_1)>\delta. Both these results follow from the fact that dd satisfies the triangle inequality and that d(θ1,θ2)2δd(\theta_1, \theta_2) \geq 2\delta. Now, we define the set E={xX d(θ^(x),θ1)<d(θ^(x),θ2)}E = \left \{x \in \mathcal{X}\:\ d\left( \hat{\theta}(x), \theta_1 \right) < d\left( \hat{\theta}(x), \theta_2 \right) \right \}. Clearly, if XEX \in E, then d(θ^(X),θ2)δd(\hat{\theta}(X), \theta_2) \geq \delta and if XEcX \in E^c then d(θ^,θ1)δd(\hat{\theta}, \theta_1) \geq \delta. Together, these results imply that d(θ^,θ1)δ1Ecd(\hat{\theta}, \theta_1) \geq \delta 1_{E^c} and d(θ^,θ2)δ1Ed(\hat{\theta}, \theta_2) \geq \delta 1_E where 1E1_E denotes the indicator function associated with a set EE. Thus we have supPPEP[d(θ^,,θ(P))]12(EP1[d(θ^,,θ(P1))]+EP2[d(θ^,,θ(P2))])12(EP1[δ1Ec]+EP2[δ1E]) = δ2(P1(Ec)+P2(E))δ2(1(P1(E)P2(E)))δ2(1supEX(P1(E)P2(E)))=δ2(1dTV(P1,P2)) \begin{aligned} \sup_{P \in \mathcal{P}}\mathbb{E}_P \left[ d \left( \hat{\theta}, \\, \theta(P) \right) \right] & \geq \frac{1}{2} \left( \mathbb{E}_{P_1} \left[ d \left( \hat{\theta}, \\, \theta(P_1) \right) \right] + \mathbb{E}_{P_2} \left[ d \left( \hat{\theta}, \\, \theta(P_2) \right) \right]\right ) \\\\ & \geq \frac{1}{2} \left( \mathbb{E}_{P_1}[\delta 1_{E^c} ] + \mathbb{E}_{P_2}[\delta 1_{E}]\right) \ = \ \frac{\delta}{2} \left( P_1(E^c) + P_2(E) \right) \\\\ & \geq \frac{\delta}{2} \left( 1 - \left(P_1(E) - P_2(E) \right) \right) \\\\ & \geq \frac{\delta}{2} \left( 1 - \sup_{E \subset \mathcal{X}}\left(P_1(E) - P_2(E) \right) \right) \\\\ & = \frac{\delta}{2} \left( 1 - d_{TV}(P_1, P_2) \right) \\\\ \end{aligned}

Finally, the result follows by noting the fact that the distributions PiPiP_i \in \mathcal{P}_i were chosen arbitrarily, and hence we can take a supremum over all such PiP_i.


Application to Active Learning in Bandits

We first describe the problem of active learning in KK-armed bandits. A multi-armed bandit (MAB) problem~(with KK arms) consists of KK distributions (P1,,PK)(P_1, \ldots, P_K) which can be individually sampled by an agent. In the problem of active learning in bandits, given a total sampling budget of nn, the goal of an agent is to allocate these nn samples among these KK distributions, in order to learn their means uniformly well. More specifically, suppose the agent allocates Ti1T_i \geq 1 samples to distribution PiP_i, with i=1KTi=n\sum_{i=1}^K T_i = n and constructs the empirical estimate of the mean of PiP_i, μ^i(Ti)=1Tij=1TiXi,j\hat{\mu}_i(T_i) = \frac{1}{T_i}\sum_{j=1}^{T_i}X_{i,j}. Then the goal is to find the allocation (T1,,TK)(T_1, \ldots, T_K) which solves

minT1,,TK:i=1KTi=nmax1iKE[μi^(Ti)μi2]=(max1iKσi2Ti)=defL(T1,,TK) \begin{aligned} \min_{T_1, \ldots, T_K: \sum_{i=1}^K T_i = n} \max_{1 \leq i \leq K} \mathbb{E} \left[ |\hat{\mu_i}(T_i) - \mu_i|^2\right] = \left( \max_{1 \leq i \leq K}\frac{ \sigma_i^2}{T_i} \right ) \stackrel{\text{def}}{=}\mathcal{L}(T_1, \ldots, T_K) \end{aligned} Allowing, for TiT_i to take real values, the optimal allocation for this above problem is given by Ti=σi2nΣT_i^* = \frac{ \sigma_i^2 n}{\Sigma} where Σ=i=1Kσi2\Sigma = \sum_{i=1}^K \sigma_i^2. Clearly, the optimal allocation (T1,,TK)(T_1^*, \ldots, T_K^*) depends on the variance of the distributions (Pi)i=1K(P_i)_{i=1}^K which are unknown to the agent, and the task
is to design an adaptive sampling strategy which appropriately addresses this explore-exploit dilemma and ends up with an allocation (T1,,TK)(T_1, \ldots, T_K) which is close to the optimal.

Lower bound construction

Consider two Bernoulli distributions UBer(u)U \sim \text{Ber(u)} and VBer(v)V \sim \text{Ber}(v) with 1/2<v<u<11/2 < v < u <1 and let M=(U,V)\mathcal{M} = (U, V) and N=(V,U)\mathcal{N} = (V, U) be two two-armed bandit problems. Suppose an allocation scheme A\mathcal{A} is applied on one of these two problems and results in an allocation (T1,T2)(T_1, T_2). Then we have the following result, which informally says that for u,vu, v close enough, no algorithm can perform well on both the problems.

Proposition 2. We have the following: infAmaxM,N  maxi=1,2  E[TiTi]=Ω(n). \inf_{\mathcal{A}} \max_{\mathcal{M}, \mathcal{N}}\; \max_{i =1, 2} \;\mathbb{E}\left[ |T_i - T_i^*| \right] = \Omega (\sqrt{n}).

Proof of Proposition 2. Together with the MAB instance (i.e., either M\mathcal{M} or N\mathcal{N}), the allocation scheme A\mathcal{A} induces a probability distribution on the space of action-observation sequences (a1,X1,a2,X2,,an,Xn)(a_1, X_1, a_2, X_2, \ldots, a_n, X_n) where each action at1,2a_t \in \\{1,2\\} and the observations XtX_t lie in 0,1\\{0, 1\\} since the distributions UU and VV are Bernoulli. We will denote the two resulting probability distributions by P1P_1 and P2P_2, corresponding to MABs M\mathcal{M} and N\mathcal{N} respectively.

We also introduce the notations TU=σU2nσU2+σV2T_U^* = \frac{\sigma_U^2 n}{\sigma_U^2 + \sigma_V^2} and TV=σV2nσU2+σV2T_V^* = \frac{\sigma_V^2 n}{\sigma_U^2 + \sigma_V^2}. Then under the MAB M\mathcal{M} the optimal allocations are (T1,T2)=(TU,TV)(T_1^*, T_2^*) =(T_U^*, T_V^*), while they are flipped for the MAB N\mathcal{N}. Since we have assumed that u>vu>v, we must have σU2<σV2\sigma_U^2 < \sigma_V^2 which implies that TU<n/2<TVT_U^* < n/2 < T_V^*.

To apply Lemma~1 to this problem, we introduce the following notations keeping the algorithm A\mathcal{A} fixed. * We choose P\mathcal{P} to be the set P1,P2\\{P_1, P_2\\} where PiP_i for i{1,2}i\in \{1,2\} were defined earlier. Since there are only two elements in P\mathcal{P}, we trivially have Pi={Pi}\mathcal{P}_i = \{P_i\} for i{1,2}i \in \{1,2\}.

Within this setting, a direct application of Lemma~1 gives us maxM,N  maxi=1,2  E[TiTi]δ(1dTV(P1,P2)).() \max_{\mathcal{M}, \mathcal{N}} \; \max_{i=1,2} \; \mathbb{E}[|T_i^*-T_i|] \geq \delta \left( 1- d_{TV}(P_1, P_2) \right). \qquad (\star)

Next, we need to obtain a lower bound on δ\delta and an upper bound on dTV(P1,P2)d_{TV}(P_1, P_2).

Lower bound on δ\delta: First we note that σU2=u(1u)\sigma_U^2 = u(1-u) and σV2=v(1v)\sigma_V^2 = v(1-v) and 2δ=nσV2σU2σV2+σU22\delta = n \frac{\sigma_V^2 - \sigma_U^2}{\sigma_V^2 + \sigma_U^2}. With the notation v=uϵv = u-\epsilon and the fact that 1/2<v<u<11/2 < v < u < 1, we can show that δn(u1/2)ϵ\delta \geq n (u-1/2)\epsilon. By choosing u=3/4u=3/4, we get δnϵ4\delta \geq \frac{n\epsilon}{4}.

Upper bound on (dTV(P1,P2)):(d_{TV}(P_1, P_2)): To bound dTV(P1,P2)d_{TV}(P_1, P_2) we proceed in the following steps: dTV(P1,P2)(i)DKL(P1,P2)2=(ii)E[T1]dKL(u,v)+E[T2]dkl(v,u)2(iii)4(uv)E[T1]+E[T2]6=ϵ8n3 \begin{aligned} d_{TV}(P_1, P_2) & \stackrel{(i)}{\leq} \sqrt{\frac{D_{KL}(P_1, P_2)}{2}} \stackrel{(ii)}{=} \sqrt{ \frac{ \mathbb{E}[T_1]d_{KL}(u,v) + \mathbb{E}[T_2]d_{kl}(v,u)}{2} } \\\\ & \stackrel{(iii)}{\leq } 4(u-v) \sqrt{ \frac{ \mathbb{E}[T_1] + \mathbb{E}[T_2] }{6} } = \epsilon \sqrt{\frac{8n}{3} } \end{aligned} In the above display, * (i) follows from an application of Pinsker’s inequality, * (ii) follows from the decomposition lemma for KL-divergence for bandits (see Eq.(5) here), and * (iii) uses the fact that the bound on KL-divergence for Bernoulli random variables dKL(u,v)(uv)2v(1v)d_{KL}(u, v) \leq \frac{(u-v)^2}{v(1-v)}.

Thus plugging these two inequalities back into the inequality ()(\star), and choosing ϵ=332n\epsilon = \sqrt{ \frac{3}{32n} } gives us maxM,N  maxi=1,2  E[TiTi]nϵ4(1ϵ8n3)=1323n2=Ω(n) \begin{aligned} \max_{\mathcal{M}, \mathcal{N}} \; \max_{i=1,2} \; \mathbb{E}[|T_i^*-T_i|] &\geq \frac{n \epsilon}{4} \left(1 - \epsilon \sqrt{\frac{8n}{3}} \right) \\ & = \frac{1}{32} \sqrt{ \frac{3n}{2} } = \Omega ( \sqrt{n} ) \end{aligned}

\blacksquare

Note that this Ω(n)\Omega(\sqrt{n}) lower bound on the deviation of (Ti)i=1K(T_i)_{i=1}^K from the optimal allocation (Ti)i=1K(T_i^*)_{i=1}^K complements the corresponding O(nlogn)\mathcal{O} \left( \sqrt{n \log n} \right) upper bound derived in Theorem~1 of (Antos et a. 2008)Antos, A., Grover, V., and Szepesvári, C. ‘Active learning in multi-armed bandits.ALT, 2008

for their GAFS-MAX algorithm. A similar upper bound was also obtained by (Carpentier et al. 2011)Carpentier, A., Lazaric, A., Ghavamzadeh, M., Munos, R., and Auer, P. ‘Upper-confidence-bound algorithms for active learning in multi-armed bandits.ALT, 2011

for their UCB-type algorithm. Thus our lower bound result demonstrates the near-optimality of these two existing algorithms.

As a corollary of the above proposition, we can obtain a lower bound on the excess loss of any allocation scheme.

Corollary. The minimax excess risk for active learning in the case of 2-armed bandit problems satisfies the following: infA  maxM,N  E[L(T1,T2)L(T1,T2)]=Ω(n3/2). \inf_{\mathcal{A}}\; \max_{\mathcal{M}, \mathcal{N}}\; \mathbb{E}\left[ \mathcal{L}(T_1, T_2) - \mathcal{L}(T_1^*, T_2^*) \right] = \Omega \left( n^{-3/2} \right).

Informally, the proof of the corollary uses the following idea: Since the objective function L\mathcal{L} at the optimal allocation is equal to σi2/Ti\sigma_i^2/T_i, the deviation from this due to suboptimal allocation TiT_i is roughly of the order of $ (_i2/(T_i)^2) |T_i - T_i^| = (^2/(_i^2 n2))|T_i*-T_i|$. The result then follows by using the Ω(n)\Omega(\sqrt{n}) lower bound on maxi=1,2  TiTi\max_{i=1,2} \; |T_i - T_i^*| from the previous proposition.