Regret bound for mixture method

Summary: Derivation of upper bound on the regret for the mixture method (KT scheme) for individual sequence prediction over finite alphabets.

Suppose $\mathcal{X} = \{1, \ldots, m\}$ for some integer $m \geq 2$ and let $\Theta = \Delta_m$ denote the $m-1$ dimensional probability simplex. We will use $p_\theta$ on $\mathcal{X}$ to denote the distribution parametrized by any $\theta \in \Theta$ .

Consider the following game.

For $t=1,2, \ldots, n$ :

Player predicts a distribution $q_n(\cdot|x^{t-1})$ on $\mathcal{X}$
Nature reveals an arbitrary element $x_t \in \mathcal{X}$
Player incurs a loss of $-\log(q_n(x_t|x^{t-1}))$ .

The goal of the player is to select a distribution $q_n$ whose cumulative loss is small when compared to the best distribution (with hindsight) among the family $\{p_{\theta}^n: \theta \in \Theta\}$ . More formally, the goal is to sequentially select a distribution $q_n$ on $\mathcal{X}^n$ with small worst-case regret: $\begin{aligned} r_n(q_n) &= \max_{x^n \in \mathcal{X}^n} \;\max_{\theta \in \Theta} \; \log \left( \frac{\log(p_{\theta}^n(x^n))}{\log(q_n(x^n))} \right ) \\ & = \max_{x^n \in \mathcal{X}^n} \; \log \left( \frac{\log(p_{\hat{\theta}}^n(x^n))}{\log(q_n(x^n))} \right ), \end{aligned}$

where $\hat{\theta}$ is the maximum likelihood estimator and $x^n = (x_1, \ldots, x_n)$ .

The optimum worst case regret or the minimax regret can then be defined as $r_n = \min_{q_n} \;\max_{x^n \in \mathcal{X}^n} \; \log \left( \frac{\log(p_{\hat{\theta}}^n(x^n))}{\log(q_n(x^n))} \right ).$

Mixture Method

In this note, we will analyze the regret of the mixture distribution that assigns the probability $\int_{\Theta} p_{\theta}^n(x^n) w(\theta) d\theta$ to $x^n$ according to some prior distribution $w$ on $\Theta$ . In particular, we will consider the mixture distribution with $\texttt{Dirichlet}(1/2, \ldots, 1/2)$ prior, and denote it by $m_J(\cdot)$ . Here the subscript indicates that $w(\theta)$ is the Jeffreys prior in this case.

With this choice of prior, the mixture distribution as well as the conditional distributions can be written in closed form. For any $t \geq 1$ , and $x^t \in \mathcal{X}^t$ , suppose $T_{i,t} = \sum_{j=1}^t 1_{x_j=i}$ is the number of occurrences of $i$ in $x^t$ . Then we can write $m_{J}(x^n)$ as follows: $\begin{aligned} &m_J(x^n) =\frac{D_m \left( T_{1,n} +1/2, \ldots, T_{m,n} + 1/2 \right) }{ D_m\left(1/2, \ldots, 1/2 \right)}, \qquad \text{where} \\ &D_m(\alpha_1, \ldots, \alpha_m) = \prod_{i=1}^m \Gamma(\alpha_i)/\Gamma(\sum_{i'=1}^m \alpha_{i'}), \end{aligned}$ and $\Gamma(\cdot)$ is the Gamma function. This gives the following closed form expression for the conditional distribution $m_{J}(x_{t+1} = i|x^{t}) = \frac{T_{i,t} + 1/2}{t+m/2},$ which can be used to sequentially implement $m_J(x^n)$ using the identity $m_J(x^n) = \prod_{t=1}^n m_J(x_t|x^{t-1})$ .

We can now state the main result, which gives us a bound on the regret of $m_J$ for any sequence $x^n$ .

Theorem: Suppose $C_m = \log (D_m(1/2, \ldots, 1/2))$ , and $T_{\min} = \min_{i} T_{i,n}$ . Then, for any $x^n \in \mathcal{X}^n$ , we have the following bound: $\log \left( \frac{\hat{p}_{\theta}^n(x^n)}{m_J(x^n)} \right) \leq \frac{m-1}{2} \log \left( \frac{n}{2\pi} \right) + C_m + \frac{m}{4 T_{\min} + 2} + o(1).$

In particular, the above bound implies that if the relative frequencies of the terms in $x^n$ lies strictly inside the simplex (i.e., $\lim_{n \to \infty} T_{\min} = \infty$ ), then the regret incurred by $m_J$ is $(m-1)\log(n/2\pi)/2 + C_m + o(1)$ . As proved in Theorem 2 of Xie and Barron (2000), this is equal to the minimax regret ( $r_n$ ) asymptotically.

Before proceeding to the proof, we first discuss the implication of the above result to the problem of betting in horse raceshere horse races model betting games where in every round only one of $m$ possible outcomes occur (i.e. only one out of $m$ horses wins the race).

.

Connections to betting

Consider $n$ rounds of horse races, each with $m$ horses numbered $\\{1, \ldots, m \\}$ . Let $\mathcal{X}^n = (X_1, \ldots, X_n)$ denote the indices of the winning horses in these $n$ rounds. Suppose a gambler begins with an initial wealth of $1$ dollar, and bets $q(i |x^{t-1})$ proportion of his wealth on horse $i$ to win race $t$ with oddsOdds of $a$ -for- $1$ on horse $i$ means the following: the gambler pays $1$ dollar before the race to bet on horse $i$ ; if horse $i$ wins, he gets back $a$ dollars while if horse $i$ loses, he receives nothing.

$O(i | x^{t-1})$ -for- $1$ . Then his total wealth after $n$ races is $S_n(q, X^n) = \prod_{t=1}^n q(X_t | X^{t-1}) O(X_t|X^{t-1}) = q(X^n)O(X^n).$

Now, if the winning indices $X_1, \ldots, X_n$ were generated i.i.d. according to distribution $p_{\theta}$ , then the betting strategy that maximizes the expected growth-rate of the wealth process is $q^* = p_{\theta}$ . This is known as Kelly betting or proportional betting strategy. $\begin{aligned} q^* & \in \text{argmax}_{q} \; \mathbb{E}_{p_{\theta}} \left[ \log \left( O(X) q(X) \right) \right] \\ & = \text{argmax}_{q} \; \mathbb{E}_{p_{\theta}} \left[ \log\left( \frac{q(X)}{p_{\theta}(X)} \right) + \log \left( p(X) O(X) \right) \right] \\ & = \text{argmax}_{q} \; \mathbb{E}_{p_{\theta}} \left[ \log \left( p(X) O(X) \right) \right] - D_{KL}(p_{\theta}//q) \\ & = \mathbb{E}_{p_{\theta}} \left[ \log \left( p(X) O(X) \right) \right] - \text{argmin}_{q} \; D_{KL}(p_{\theta}//q), \end{aligned}$ which implies that $q^*=p_{\theta}$ due to non-negativity of KL-divergence.

Now, suppose we don’t make any probabilistic assumptions on the process generating $X^n$ . We can still define the best constant betting strategy (in hindsight) for a sequence $x^n$ as the one which maximizes the wealth: $\max_{\theta} p_{\theta}^n(x^n) O(x^n) = p_{\hat{\theta}}^n(x^n) O(x^n).$

If the gambler follows some strategy $q_n$ for betting, then the ratio the wealth gained by the two policies (i.e., $p_{\hat{\theta}}^n$ and $q_n$ ) is given by $\frac{S_n(p_{\hat{\theta}}^n, x^n)}{S_n(q_n, x^n)} = \frac{p_{\hat{\theta}}^n(x^n)}{q_n(x^n)}$ independent of the odds offered. Hence, the problem of finding the distribution $q_n$ that maximizes this ratio in the worst-case (i.e., for the worst sequence $x^n \in \mathcal{X}^n$ ) reduces to the sequential prediction problem considered above. In particular, the regret bound for $m_J$ stated earlier implies the following:

$\frac{S_n(p_{\hat{\theta}}^n, x^n)}{S_n(q_n, x^n)} \leq \frac{m-1}{2} \log(n/2\pi) + C_m + o(1),$ assuming that the relative frequencies of $x^n$ lies in the interior of the simplex. This further gives us the following bound $S_n(m_J, x^n) \geq S_n \left( \hat{p}_{\theta}^n, x^n \right) \left( \frac{n}{2\pi} \right)^{(m-1)/2}. e^{-C_m - o(1)}.$

Proof of the Regret Bound

Throughout this section, we will use $T_i$ instead of $T_{i,n}$ . To prove the result, we first note the following:

$\begin{aligned} \log \left( \frac{p_{\hat{\theta}}(x^n)}{m_J(x^n)} \right) & = \log( p_{\hat{\theta}}(x^n) ) + C_m - \log \left( D_m\left(T_1+\frac{1}{2}, \ldots, T_m+\frac{1}{2}\right) \right) \qquad (1) \end{aligned}$

Since $\hat{\theta} = \left( (T_1/n), \ldots, (T_m/n) \right)$ , we have $\log(p_{\hat{\theta}}^n(x^n)) = \sum_{i=1}^m T_i \log(T_i/n) = -n \log n + \sum_{i}T_i \log(T_i). \qquad (2)$

For the remaining term, we use Stirling’s approximation $\Gamma(x) = x^{x-1/2}e^{-x} \sqrt{2\pi} e^{s_x}$ with $s_x \in (0, 1/(12x))$ . Plugging this into the expression for $A = D_m(T_1+1/2, \ldots, T_m+1/2)$ , we get

$\begin{aligned} A &= \frac{ \prod_{i=1}^m \Gamma(T_i + 1/2) }{\Gamma(n+m/2)} = \frac{ \prod_{i} \left( (T_i+1/2)^{T_i} e^{-T_i} \sqrt{2\pi} e^{-s_i} \right) }{ (n+m/2)^{n+(m-1)/2} e^{-n-m/2} \sqrt{2\pi}e^{-s_n} }, \end{aligned}$ where $s_i \in (0, 1/12T_i)$ and $s_n \in (0, 1/(12(n+m/2)))$ . Simplifying further, we get $\begin{aligned} A &= \frac{ \left( \prod_{i} (T_i+1/2)^{T_i} \right) (2\pi)^{(m-1)/2} }{(n+m/2)^{n+(m-1)/2}e^{-m/2} e^{s_n - \sum_{i}s_i} } \end{aligned}$ which implies

$\begin{aligned} \log(1/A) &= \left( n + \frac{m-1}{2} \right) \log(n+m/2) - \frac{m-1}{2} \log(2\pi) \\\\ &\ \qquad + (s_n - \sum_i s_i) - \sum_i T_i \log(T_i + 1/2). \qquad (3) \end{aligned}$

Plugging equations (2) and (3) into (1), we get the following:

$\begin{aligned} \log \left( \frac{p_{\hat{\theta}}^n(x^n)}{m_J(x^n)} \right) &= \frac{m-1}{2} \log(n/2\pi) + C_m + \texttt{Rem}_n, \end{aligned}$

where the remainder terms is defined as

$\begin{aligned} \texttt{Rem}_n &= -\sum\_{i=1}^{m} T_i \log \left( 1 + \frac{1}{2T_i} \right) + n \log(1+\frac{m}{2n}) \\\\ & \quad + \frac{m-1}{2} \log \left( 1 + \frac{m}{2n} \right) + (s_n - \sum\_{i=1}^m s_i). \qquad (4) \end{aligned}$

To complete the proof, we will obtain upper bounds on the terms in the RHS of (4). First we observe that $s_n - \sum\_{i=1}^m s_i < s_n < \frac{1}{12(n+m/2)} < \frac{1}{12n + 6}.$

Next, we use the following inequality for any $x>0$ $\frac{1}{2} - \frac{1}{4(x+1/2)} \leq \quad x \log \left(1 + \frac{1}{2x} \right) \quad \leq \frac{1}{2}$ to bound the remaining terms in (4) as follows:

$\begin{aligned} n \log\left( 1 + \frac{m}{2n} \right) & \leq \frac{m}{2} \\\\ \frac{m-1}{2} \log\left( 1 + \frac{m}{2n} \right) & \leq \frac{m(m-1)}{4n} \leq \frac{m^2}{4n} \\\\ -\sum\_{i=1}^{m} T_i \log \left( 1 + \frac{1}{2T_i}\right) & \leq -\frac{m}{2} + \frac{m}{4(T_{\min} + 1/2)} \end{aligned}$

Combining the inequalities in the above display, we get $\begin{aligned} \texttt{Rem}_n &\leq \frac{1}{12n + 6} + \frac{m^2}{4n} + \frac{m}{4T_{\min} + 2} \\\\ & = \frac{m}{4T_{\min} + 2} + o(1). \end{aligned}$ This completes the proof.