Summary: We introduce LeCam’s method for obtaining minimax lower bounds for statistical estimation problems, which proceeds by relating the probability of error of a binary hypothesis testing problem to the total-variation distance between the two distributions. Then, as a novel application, we use this technique to derive a lower bound on the excess risk for the problem of active learning in bandits. This establishes the near-optimality of existing algorithms due to Antos et al. (2008) and Carpentier et al. (2011).
Consider the following setting:
Suppose P denotes a class of probability distributions,
(D,d) is a metric spaceNote that we only require d to be non-negative, symmetric and satisfy the triangle inequality for the lemma to work. In fact, as noted in (Yu, 1997), even the requirement of triangle inequality can be waived and the following ‘weak’ triangle inequality suffices: for some A∈(0,1) and for any x,y,z∈D, we have d(x,y)+d(y,z)≥Ad(x,z).
,
θ:P↦D represents some D-valued functional,
D1,D2 represent two disjoint and 2δ separated subsets of D, for some δ>0,
Pi=θ−1(Di) for i=1,2 are non-empty disjoint subsets of P.
Suppose we are given a sample X∼P taking values in some set X for P∈P, and let θ^:X↦D denote an estimator of θ(P). Then for any estimator θ^ we can obtain the following lower bound on the maximum expected error:
Lemma 1 (Le Cam). With the definitions introduced above, we have P∈PsupEP[d(θ^,θ(P))]≥δPi∈Pimax(1−dTV(P1,P2)).
In the above display, dTV denotes the total variation distance between two distributions, defined as dTV(P,Q)=supEP(E)−Q(E) where the supremum is over all measurable sets E. The result above implies that the minmax lower bound for a particular estimation problem is large if there exist two distributions P1 and P2 which (i) are ‘well-separated’ in terms of the d−metric, and (ii) are statistically ‘close’ in terms of dTV(⋅,⋅). Due to their opposing nature, obtaining the best lower bound requires finding the right balance between these two requirements.
The statement of the above result and its proof are based on the statement and proof of Lemma~1 in (Yu, 1997)Yu, B. (1997). Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam.
.
Proof of the Lemma
First, we select arbitrary Pi∈Pi for i=1,2 and lower-bound the supremum over all P∈P with a simple average over these two distributions.
P∈PsupEP[d(θ^,θ(P))]≥21(EP1[d(θ^,θ(P1))]+EP2[d(θ^,θ(P2))]) Next, with the notation θi=θ(Pi), we observeIt is only at this point that we use the fact that d satisfies the triangle inequality.
the following: for any θ∈D such that d(θ,θ1)<δ, we must have d(θ,θ2)>δ. Similarly, if d(θ,θ2)<δ then d(θ,θ1)>δ. Both these results follow from the fact that d satisfies the triangle inequality and that d(θ1,θ2)≥2δ. Now, we define the set E={x∈Xd(θ^(x),θ1)<d(θ^(x),θ2)}. Clearly, if X∈E, then d(θ^(X),θ2)≥δ and if X∈Ec then d(θ^,θ1)≥δ. Together, these results imply that d(θ^,θ1)≥δ1Ec and d(θ^,θ2)≥δ1E where 1E denotes the indicator function associated with a set E. Thus we have P∈PsupEP[d(θ^,,θ(P))]≥21(EP1[d(θ^,,θ(P1))]+EP2[d(θ^,,θ(P2))])≥21(EP1[δ1Ec]+EP2[δ1E])=2δ(P1(Ec)+P2(E))≥2δ(1−(P1(E)−P2(E)))≥2δ(1−E⊂Xsup(P1(E)−P2(E)))=2δ(1−dTV(P1,P2))
Finally, the result follows by noting the fact that the distributions Pi∈Pi were chosen arbitrarily, and hence we can take a supremum over all such Pi.
Application to Active Learning in Bandits
We first describe the problem of active learning in K-armed bandits. A multi-armed bandit (MAB) problem~(with K arms) consists of K distributions (P1,…,PK) which can be individually sampled by an agent. In the problem of active learning in bandits, given a total sampling budget of n, the goal of an agent is to allocate these n samples among these K distributions, in order to learn their means uniformly well. More specifically, suppose the agent allocates Ti≥1 samples to distribution Pi, with ∑i=1KTi=n and constructs the empirical estimate of the mean of Pi, μ^i(Ti)=Ti1∑j=1TiXi,j. Then the goal is to find the allocation (T1,…,TK) which solves
T1,…,TK:∑i=1KTi=nmin1≤i≤KmaxE[∣μi^(Ti)−μi∣2]=(1≤i≤KmaxTiσi2)=defL(T1,…,TK) Allowing, for Ti to take real values, the optimal allocation for this above problem is given by Ti∗=Σσi2n where Σ=∑i=1Kσi2. Clearly, the optimal allocation (T1∗,…,TK∗) depends on the variance of the distributions (Pi)i=1K which are unknown to the agent, and the task
is to design an adaptive sampling strategy which appropriately addresses this explore-exploit dilemma and ends up with an allocation (T1,…,TK) which is close to the optimal.
Lower bound construction
Consider two Bernoulli distributions U∼Ber(u) and V∼Ber(v) with 1/2<v<u<1 and let M=(U,V) and N=(V,U) be two two-armed bandit problems. Suppose an allocation scheme A is applied on one of these two problems and results in an allocation (T1,T2). Then we have the following result, which informally says that for u,v close enough, no algorithm can perform well on both the problems.
Proposition 2. We have the following: AinfM,Nmaxi=1,2maxE[∣Ti−Ti∗∣]=Ω(n).
Proof of Proposition 2. Together with the MAB instance (i.e., either M or N), the allocation scheme A induces a probability distribution on the space of action-observation sequences (a1,X1,a2,X2,…,an,Xn) where each action at∈1,2 and the observations Xt lie in 0,1 since the distributions U and V are Bernoulli. We will denote the two resulting probability distributions by P1 and P2, corresponding to MABs M and N respectively.
We also introduce the notations TU∗=σU2+σV2σU2n and TV∗=σU2+σV2σV2n. Then under the MAB M the optimal allocations are (T1∗,T2∗)=(TU∗,TV∗), while they are flipped for the MAB N. Since we have assumed that u>v, we must have σU2<σV2 which implies that TU∗<n/2<TV∗.
To apply Lemma~1 to this problem, we introduce the following notations keeping the algorithm A fixed. * We choose P to be the set P1,P2 where Pi for i∈{1,2} were defined earlier. Since there are only two elements in P, we trivially have Pi={Pi} for i∈{1,2}.
We define θ:P↦R as as the mapping from P∈P to the corresponding $T_1^* $. That is, $(P_1) = T_U^* $ and $(P_2) = T_V^* $. The metric d is chosen as d(t1,t2)=∣t1−t2∣.
The estimate θ^ is the allocation value T1 resulting from the scheme A. Note that since T1+T2=n, we have ∣T1∗−T1∣=∣T2∗−T2∣. Therefore, we always have E[∣T1−T1∗∣]=maxi=1,2E[∣Ti−Ti∗∣].
Finally, we introduce the notation δ=∣TU∗−TV∗∣/2.
Within this setting, a direct application of Lemma~1 gives us M,Nmaxi=1,2maxE[∣Ti∗−Ti∣]≥δ(1−dTV(P1,P2)).(⋆)
Next, we need to obtain a lower bound on δ and an upper bound on dTV(P1,P2).
Lower bound on δ: First we note that σU2=u(1−u) and σV2=v(1−v) and 2δ=nσV2+σU2σV2−σU2. With the notation v=u−ϵ and the fact that 1/2<v<u<1, we can show that δ≥n(u−1/2)ϵ. By choosing u=3/4, we get δ≥4nϵ.
Upper bound on(dTV(P1,P2)): To bound dTV(P1,P2) we proceed in the following steps: dTV(P1,P2)≤(i)2DKL(P1,P2)=(ii)2E[T1]dKL(u,v)+E[T2]dkl(v,u)≤(iii)4(u−v)6E[T1]+E[T2]=ϵ38n In the above display, * (i) follows from an application of Pinsker’s inequality, * (ii) follows from the decomposition lemma for KL-divergence for bandits (see Eq.(5) here), and * (iii) uses the fact that the bound on KL-divergence for Bernoulli random variables dKL(u,v)≤v(1−v)(u−v)2.
Thus plugging these two inequalities back into the inequality (⋆), and choosing ϵ=32n3 gives us M,Nmaxi=1,2maxE[∣Ti∗−Ti∣]≥4nϵ(1−ϵ38n)=32123n=Ω(n)
■
Note that this Ω(n) lower bound on the deviation of (Ti)i=1K from the optimal allocation (Ti∗)i=1K complements the corresponding O(nlogn) upper bound derived in Theorem~1 of (Antos et a. 2008)Antos, A., Grover, V., and Szepesvári, C. ‘Active learning in multi-armed bandits.’ ALT, 2008
for their UCB-type algorithm. Thus our lower bound result demonstrates the near-optimality of these two existing algorithms.
As a corollary of the above proposition, we can obtain a lower bound on the excess loss of any allocation scheme.
Corollary. The minimax excess risk for active learning in the case of 2-armed bandit problems satisfies the following: AinfM,NmaxE[L(T1,T2)−L(T1∗,T2∗)]=Ω(n−3/2).
Informally, the proof of the corollary uses the following idea: Since the objective function L at the optimal allocation is equal to σi2/Ti, the deviation from this due to suboptimal allocation Ti is roughly of the order of $ (_i2/(T_i)^2) |T_i - T_i^| = (^2/(_i^2 n2))|T_i*-T_i|$. The result then follows by using the Ω(n) lower bound on maxi=1,2∣Ti−Ti∗∣ from the previous proposition.