Regret bound for mixture method

Shubhanshu Shekhar

Summary: Derivation of upper bound on the regret for the mixture method (KT scheme) for individual sequence prediction over finite alphabets.


Suppose X={1,,m}\mathcal{X} = \{1, \ldots, m\} for some integer m2m \geq 2 and let Θ=Δm\Theta = \Delta_m denote the m1m-1 dimensional probability simplex. We will use pθp_\theta on X\mathcal{X} to denote the distribution parametrized by any θΘ\theta \in \Theta.

Consider the following game.


For t=1,2,,nt=1,2, \ldots, n:


The goal of the player is to select a distribution qnq_n whose cumulative loss is small when compared to the best distribution (with hindsight) among the family {pθn:θΘ}\{p_{\theta}^n: \theta \in \Theta\}. More formally, the goal is to sequentially select a distribution qnq_n on Xn\mathcal{X}^n with small worst-case regret: rn(qn)=maxxnXn  maxθΘ  log(log(pθn(xn))log(qn(xn)))=maxxnXn  log(log(pθ^n(xn))log(qn(xn))), \begin{aligned} r_n(q_n) &= \max_{x^n \in \mathcal{X}^n} \;\max_{\theta \in \Theta} \; \log \left( \frac{\log(p_{\theta}^n(x^n))}{\log(q_n(x^n))} \right ) \\ & = \max_{x^n \in \mathcal{X}^n} \; \log \left( \frac{\log(p_{\hat{\theta}}^n(x^n))}{\log(q_n(x^n))} \right ), \end{aligned}

where θ^\hat{\theta} is the maximum likelihood estimator and xn=(x1,,xn)x^n = (x_1, \ldots, x_n).

The optimum worst case regret or the minimax regret can then be defined as rn=minqn  maxxnXn  log(log(pθ^n(xn))log(qn(xn))). r_n = \min_{q_n} \;\max_{x^n \in \mathcal{X}^n} \; \log \left( \frac{\log(p_{\hat{\theta}}^n(x^n))}{\log(q_n(x^n))} \right ).

Mixture Method

In this note, we will analyze the regret of the mixture distribution that assigns the probability Θpθn(xn)w(θ)dθ\int_{\Theta} p_{\theta}^n(x^n) w(\theta) d\theta to xnx^n according to some prior distribution ww on Θ\Theta. In particular, we will consider the mixture distribution with Dirichlet(1/2,,1/2)\texttt{Dirichlet}(1/2, \ldots, 1/2) prior, and denote it by mJ()m_J(\cdot). Here the subscript indicates that w(θ)w(\theta) is the Jeffreys prior in this case.

With this choice of prior, the mixture distribution as well as the conditional distributions can be written in closed form. For any t1t \geq 1, and xtXtx^t \in \mathcal{X}^t, suppose Ti,t=j=1t1xj=iT_{i,t} = \sum_{j=1}^t 1_{x_j=i} is the number of occurrences of ii in xtx^t. Then we can write mJ(xn)m_{J}(x^n) as follows: mJ(xn)=Dm(T1,n+1/2,,Tm,n+1/2)Dm(1/2,,1/2),whereDm(α1,,αm)=i=1mΓ(αi)/Γ(i=1mαi), \begin{aligned} &m_J(x^n) =\frac{D_m \left( T_{1,n} +1/2, \ldots, T_{m,n} + 1/2 \right) }{ D_m\left(1/2, \ldots, 1/2 \right)}, \qquad \text{where} \\ &D_m(\alpha_1, \ldots, \alpha_m) = \prod_{i=1}^m \Gamma(\alpha_i)/\Gamma(\sum_{i'=1}^m \alpha_{i'}), \end{aligned} and Γ()\Gamma(\cdot) is the Gamma function. This gives the following closed form expression for the conditional distribution mJ(xt+1=ixt)=Ti,t+1/2t+m/2, m_{J}(x_{t+1} = i|x^{t}) = \frac{T_{i,t} + 1/2}{t+m/2}, which can be used to sequentially implement mJ(xn)m_J(x^n) using the identity mJ(xn)=t=1nmJ(xtxt1)m_J(x^n) = \prod_{t=1}^n m_J(x_t|x^{t-1}).

We can now state the main result, which gives us a bound on the regret of mJm_J for any sequence xnx^n.

Theorem: Suppose Cm=log(Dm(1/2,,1/2))C_m = \log (D_m(1/2, \ldots, 1/2)), and Tmin=miniTi,nT_{\min} = \min_{i} T_{i,n}. Then, for any xnXnx^n \in \mathcal{X}^n, we have the following bound: log(p^θn(xn)mJ(xn))m12log(n2π)+Cm+m4Tmin+2+o(1). \log \left( \frac{\hat{p}_{\theta}^n(x^n)}{m_J(x^n)} \right) \leq \frac{m-1}{2} \log \left( \frac{n}{2\pi} \right) + C_m + \frac{m}{4 T_{\min} + 2} + o(1).

In particular, the above bound implies that if the relative frequencies of the terms in xnx^n lies strictly inside the simplex (i.e., limnTmin=\lim_{n \to \infty} T_{\min} = \infty), then the regret incurred by mJm_J is (m1)log(n/2π)/2+Cm+o(1)(m-1)\log(n/2\pi)/2 + C_m + o(1). As proved in Theorem 2 of Xie and Barron (2000), this is equal to the minimax regret (rnr_n) asymptotically.

Before proceeding to the proof, we first discuss the implication of the above result to the problem of betting in horse raceshere horse races model betting games where in every round only one of mm possible outcomes occur (i.e. only one out of mm horses wins the race).

.

Connections to betting

Consider nn rounds of horse races, each with mm horses numbered 1,,m\\{1, \ldots, m \\}. Let Xn=(X1,,Xn)\mathcal{X}^n = (X_1, \ldots, X_n) denote the indices of the winning horses in these nn rounds. Suppose a gambler begins with an initial wealth of 11 dollar, and bets q(ixt1)q(i |x^{t-1}) proportion of his wealth on horse ii to win race tt with oddsOdds of aa-for-11 on horse ii means the following: the gambler pays 11 dollar before the race to bet on horse ii; if horse ii wins, he gets back aa dollars while if horse ii loses, he receives nothing.

O(ixt1)O(i | x^{t-1})-for-11. Then his total wealth after nn races is Sn(q,Xn)=t=1nq(XtXt1)O(XtXt1)=q(Xn)O(Xn). S_n(q, X^n) = \prod_{t=1}^n q(X_t | X^{t-1}) O(X_t|X^{t-1}) = q(X^n)O(X^n).

Now, if the winning indices X1,,XnX_1, \ldots, X_n were generated i.i.d. according to distribution pθp_{\theta}, then the betting strategy that maximizes the expected growth-rate of the wealth process is q=pθq^* = p_{\theta}. This is known as Kelly betting or proportional betting strategy. qargmaxq  Epθ[log(O(X)q(X))]=argmaxq  Epθ[log(q(X)pθ(X))+log(p(X)O(X))]=argmaxq  Epθ[log(p(X)O(X))]DKL(pθ//q)=Epθ[log(p(X)O(X))]argminq  DKL(pθ//q), \begin{aligned} q^* & \in \text{argmax}_{q} \; \mathbb{E}_{p_{\theta}} \left[ \log \left( O(X) q(X) \right) \right] \\ & = \text{argmax}_{q} \; \mathbb{E}_{p_{\theta}} \left[ \log\left( \frac{q(X)}{p_{\theta}(X)} \right) + \log \left( p(X) O(X) \right) \right] \\ & = \text{argmax}_{q} \; \mathbb{E}_{p_{\theta}} \left[ \log \left( p(X) O(X) \right) \right] - D_{KL}(p_{\theta}//q) \\ & = \mathbb{E}_{p_{\theta}} \left[ \log \left( p(X) O(X) \right) \right] - \text{argmin}_{q} \; D_{KL}(p_{\theta}//q), \end{aligned} which implies that q=pθq^*=p_{\theta} due to non-negativity of KL-divergence.

Now, suppose we don’t make any probabilistic assumptions on the process generating XnX^n. We can still define the best constant betting strategy (in hindsight) for a sequence xnx^n as the one which maximizes the wealth: maxθpθn(xn)O(xn)=pθ^n(xn)O(xn). \max_{\theta} p_{\theta}^n(x^n) O(x^n) = p_{\hat{\theta}}^n(x^n) O(x^n).

If the gambler follows some strategy qnq_n for betting, then the ratio the wealth gained by the two policies (i.e., pθ^np_{\hat{\theta}}^n and qnq_n) is given by Sn(pθ^n,xn)Sn(qn,xn)=pθ^n(xn)qn(xn) \frac{S_n(p_{\hat{\theta}}^n, x^n)}{S_n(q_n, x^n)} = \frac{p_{\hat{\theta}}^n(x^n)}{q_n(x^n)} independent of the odds offered. Hence, the problem of finding the distribution qnq_n that maximizes this ratio in the worst-case (i.e., for the worst sequence xnXnx^n \in \mathcal{X}^n) reduces to the sequential prediction problem considered above. In particular, the regret bound for mJm_J stated earlier implies the following:

Sn(pθ^n,xn)Sn(qn,xn)m12log(n/2π)+Cm+o(1), \frac{S_n(p_{\hat{\theta}}^n, x^n)}{S_n(q_n, x^n)} \leq \frac{m-1}{2} \log(n/2\pi) + C_m + o(1), assuming that the relative frequencies of xnx^n lies in the interior of the simplex. This further gives us the following bound Sn(mJ,xn)Sn(p^θn,xn)(n2π)(m1)/2.eCmo(1). S_n(m_J, x^n) \geq S_n \left( \hat{p}_{\theta}^n, x^n \right) \left( \frac{n}{2\pi} \right)^{(m-1)/2}. e^{-C_m - o(1)}.

Proof of the Regret Bound

Throughout this section, we will use TiT_i instead of Ti,nT_{i,n}. To prove the result, we first note the following:

log(pθ^(xn)mJ(xn))=log(pθ^(xn))+Cmlog(Dm(T1+12,,Tm+12))(1) \begin{aligned} \log \left( \frac{p_{\hat{\theta}}(x^n)}{m_J(x^n)} \right) & = \log( p_{\hat{\theta}}(x^n) ) + C_m - \log \left( D_m\left(T_1+\frac{1}{2}, \ldots, T_m+\frac{1}{2}\right) \right) \qquad (1) \end{aligned}

Since θ^=((T1/n),,(Tm/n))\hat{\theta} = \left( (T_1/n), \ldots, (T_m/n) \right), we have log(pθ^n(xn))=i=1mTilog(Ti/n)=nlogn+iTilog(Ti).(2) \log(p_{\hat{\theta}}^n(x^n)) = \sum_{i=1}^m T_i \log(T_i/n) = -n \log n + \sum_{i}T_i \log(T_i). \qquad (2)

For the remaining term, we use Stirling’s approximation Γ(x)=xx1/2ex2πesx\Gamma(x) = x^{x-1/2}e^{-x} \sqrt{2\pi} e^{s_x} with sx(0,1/(12x))s_x \in (0, 1/(12x)). Plugging this into the expression for A=Dm(T1+1/2,,Tm+1/2)A = D_m(T_1+1/2, \ldots, T_m+1/2), we get

A=i=1mΓ(Ti+1/2)Γ(n+m/2)=i((Ti+1/2)TieTi2πesi)(n+m/2)n+(m1)/2enm/22πesn, \begin{aligned} A &= \frac{ \prod_{i=1}^m \Gamma(T_i + 1/2) }{\Gamma(n+m/2)} = \frac{ \prod_{i} \left( (T_i+1/2)^{T_i} e^{-T_i} \sqrt{2\pi} e^{-s_i} \right) }{ (n+m/2)^{n+(m-1)/2} e^{-n-m/2} \sqrt{2\pi}e^{-s_n} }, \end{aligned} where si(0,1/12Ti)s_i \in (0, 1/12T_i) and sn(0,1/(12(n+m/2)))s_n \in (0, 1/(12(n+m/2))). Simplifying further, we get A=(i(Ti+1/2)Ti)(2π)(m1)/2(n+m/2)n+(m1)/2em/2esnisi \begin{aligned} A &= \frac{ \left( \prod_{i} (T_i+1/2)^{T_i} \right) (2\pi)^{(m-1)/2} }{(n+m/2)^{n+(m-1)/2}e^{-m/2} e^{s_n - \sum_{i}s_i} } \end{aligned} which implies

log(1/A)=(n+m12)log(n+m/2)m12log(2π) +(snisi)iTilog(Ti+1/2).(3) \begin{aligned} \log(1/A) &= \left( n + \frac{m-1}{2} \right) \log(n+m/2) - \frac{m-1}{2} \log(2\pi) \\\\ &\ \qquad + (s_n - \sum_i s_i) - \sum_i T_i \log(T_i + 1/2). \qquad (3) \end{aligned}

Plugging equations (2) and (3) into (1), we get the following:

log(pθ^n(xn)mJ(xn))=m12log(n/2π)+Cm+Remn, \begin{aligned} \log \left( \frac{p_{\hat{\theta}}^n(x^n)}{m_J(x^n)} \right) &= \frac{m-1}{2} \log(n/2\pi) + C_m + \texttt{Rem}_n, \end{aligned}

where the remainder terms is defined as

Remn=_i=1mTilog(1+12Ti)+nlog(1+m2n)+m12log(1+m2n)+(sn_i=1msi).(4) \begin{aligned} \texttt{Rem}_n &= -\sum\_{i=1}^{m} T_i \log \left( 1 + \frac{1}{2T_i} \right) + n \log(1+\frac{m}{2n}) \\\\ & \quad + \frac{m-1}{2} \log \left( 1 + \frac{m}{2n} \right) + (s_n - \sum\_{i=1}^m s_i). \qquad (4) \end{aligned}

To complete the proof, we will obtain upper bounds on the terms in the RHS of (4). First we observe that sn_i=1msi<sn<112(n+m/2)<112n+6. s_n - \sum\_{i=1}^m s_i < s_n < \frac{1}{12(n+m/2)} < \frac{1}{12n + 6}.

Next, we use the following inequality for any x>0x>0 1214(x+1/2)xlog(1+12x)12 \frac{1}{2} - \frac{1}{4(x+1/2)} \leq \quad x \log \left(1 + \frac{1}{2x} \right) \quad \leq \frac{1}{2} to bound the remaining terms in (4) as follows:

nlog(1+m2n)m2m12log(1+m2n)m(m1)4nm24n_i=1mTilog(1+12Ti)m2+m4(Tmin+1/2) \begin{aligned} n \log\left( 1 + \frac{m}{2n} \right) & \leq \frac{m}{2} \\\\ \frac{m-1}{2} \log\left( 1 + \frac{m}{2n} \right) & \leq \frac{m(m-1)}{4n} \leq \frac{m^2}{4n} \\\\ -\sum\_{i=1}^{m} T_i \log \left( 1 + \frac{1}{2T_i}\right) & \leq -\frac{m}{2} + \frac{m}{4(T_{\min} + 1/2)} \end{aligned}

Combining the inequalities in the above display, we get Remn112n+6+m24n+m4Tmin+2=m4Tmin+2+o(1). \begin{aligned} \texttt{Rem}_n &\leq \frac{1}{12n + 6} + \frac{m^2}{4n} + \frac{m}{4T_{\min} + 2} \\\\ & = \frac{m}{4T_{\min} + 2} + o(1). \end{aligned} This completes the proof.