# Lecture 5: Hypothesis Testing and Le Cam’s Two-point Method

(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.)

In previous lectures we have seen the ideas of model reduction and its application in establishing the asymptotic lower bounds. However, in many scenarios we would like to find good non-asymptotic bounds, and as we show in the examples of Lecture 3, finding such reductions may be quite difficult. In this and the next few lectures, we will introduce the testing idea which transforms the target problem (usually the estimation problem) into a proper testing problem. Despite the conceptual simplicity, the testing problem may take various forms, some of which may be difficult to analyze. This will be the central topic of the next few lectures.

1. $f$-divergences and Inequalities

1.1. $f$-divergences

First we look at the possibly simplest testing problem: if $P$ and $Q$ are two probability measures on the same probability space, and a random sample $X$ is drawn from either $P$ or $Q$. A (binary) testing problem aims to find some test $\Psi(X)\in \{0,1\}$ which with high probability outputs $0$ when $X\sim P$, and outputs $1$ when $X\sim Q$. To quantify the performance of a certain test $\Psi$, we define the total error probability as follows:

$\text{err}(\Psi) = P(\Psi(X) \neq 0) + Q(\Psi(X)\neq 1).$

Intuitively, when $P$ and $Q$ are close to each other, we will expect that the total error probability for any test is large for they are hard to distinguish. Conversely, when $P$ and $Q$ are very far from each other (e.g., with a disjoint support), then there exists some test which can well distinguish them. The next lemma gives the expression of the minimum $\text{err}(\Psi)$.

Lemma 1 (Le Cam’s first lemma) Minimizing all measurable tests $\Psi$ gives

$\min_{\Psi} \text{\rm err}(\Psi) = 1 - \|P-Q\|_{\text{\rm TV}}.$

Recall that $\|P-Q\|_{\text{\rm TV}} = \frac{1}{2}\int |dP-dQ|$ is the total variation distance between $P$ and $Q$

Proof: Let $A = \{X: \Psi(X) = 0\}$, then

$\text{err}(\Psi) = 1 - P(A) + Q(A) \ge 1 - \sup_B |P(B) - Q(B)| = 1 - \|P-Q\|_{\text{\rm TV}},$

with equality if and only if $A = \{x: P(dx) \ge Q(dx) \}$. $\Box$

By Lemma 1, the total variation distance between $P$ and $Q$ exactly quantifies the minimum total error probability of distinguishing between two probability distributions. Similar to Lemma 1, if we impose different weights on the error probabilities, we will arrive at some divergences of the form $c\cdot \int |dP - w\cdot dQ|$, where $c>0$ is some constant, and $w>0$ is the weight on $Q(\Psi(X)\neq 1)$. Recall from convex analysis that for any convex function $f$ with compact support on ${\mathbb R}$, its second-order derivative $f''$ always exists in the sense of a positive Radon measure, and

$f(x) = a + bx + \frac{1}{2}\int |x-t| f''(dt) \ \ \ \ \ (1)$

for some $a,b\in {\mathbb R}$.

Remark 1 The decomposition (1) of convex functions has several applications, e.g., in the local time analysis of continuous martingales.

Motivated by (1), we can actually take a convex combination of the above divergences and arrive at the general class of $f$-divergences.

Definition 2 ($f$-divergence) For a convex function $f$ on $[0,\infty)$ with $f(1)=0$, the $f$-divergence between probability measures $P$ and $Q$ on $(\Omega,\mathcal{F})$ is defined as

$D_f(P, Q) = \int f\left(\frac{dP^\parallel}{dQ}\right) dQ + f^\star(0) (1- P^\parallel(\Omega) ),$

where $P^\parallel$ is the absolute continuous part of $P$ w.r.t. $Q$, and $f^\star(0) := \lim_{t\rightarrow\infty} f(t)/t$

In most practical scenarios, we have $P\ll Q$, and therefore $D_f(P\|Q) = \mathop{\mathbb E}_Q f(dP/dQ)$. Some well-known examples of $f$-divergences are as follows.

Example 1

1. $f(x) = |x-1|$: total variation distance $\|P-Q\|_{\text{\rm TV}}$

2. $f(x)=(\sqrt{x}-1)^2$: Hellinger distance $H^2(P,Q) = \int (\sqrt{dP}-\sqrt{dQ})^2$

3. $f(x)=-\log x$: Kullback–Leibler (KL) divergence $D_{\text{\rm KL}}(P\|Q) = \int dP\log\frac{dP}{dQ}$

4. $f(x) = x^2-1$: $\chi^2$-divergence $\chi^2(P,Q) = \int\frac{(dP-dQ)^2}{dQ}$

The following properties also hold for any $f$-divergences.

Lemma 3 $D_f(P,Q)\ge 0$ and is joint convex in $(P,Q)$. In particular, $D_f(\mathsf{K}P, \mathsf{K}Q)\le D_f(P,Q)$ for any transition kernel $\mathsf{K}$

Remark 2 The last inequality is known as the data processing inequality.

Proof: The non-negativity follows from convexity and Jensen’s inequality. The joint convexity is due to that of the perspective transform $(x,y)\mapsto yf(x/y)$ for convex $f$. Let $Y$ be the output of $X$ through kernel $\mathsf{K}$, then the last inequality follows from the joint convexity and

$\mathop{\mathbb E}_{Q_{X|Y}}\left[ \frac{dP_{XY}}{dQ_{XY}} \right] = \frac{dP_Y}{dQ_Y}.$

We leave the details to the readers. $\Box$

1.2. Joint range

Lemma 1 involves the computation of the total variation distance, which may be hard to evaluate exactly for most statistical models, especially when there is a tensor power structure. As a comparison, Hellinger, KL and $\chi^2$-divergences tensorize better:

$\begin{array}{rcl} H^2(\otimes_i P_i, \otimes_i Q_i) &=& 2 - 2\prod_i \left(1-\frac{1}{2}H^2(P_i, Q_i) \right),\\ D_{\text{KL}}(\otimes_i P_i\| \otimes_i Q_i) &=& \sum_i D_{\text{KL}}(P_i \| Q_i), \\ \chi^2(\otimes_i P_i, \otimes_i Q_i) &=& \prod_i \left(1 + \chi^2( P_i, Q_i)\right) - 1. \end{array}$

Consequently, to apply Lemma 1 to proving lower bounds, we need upper bounds of the total variation distance in terms of Hellinger, KL or $\chi^2$. We have seen examples of such inequalities from Lecture 3, but have no clue whether the claimed inequalities are tight. To obtain the best inequalities between different $f$-divergences, we study the joint range of $f$-divergences.

For two convex functions $f,g$ on ${\mathbb R}_+$ with $f(1) = g(1)=0$, we call $(r,s)$ is in the joint range of $(D_f, D_g)$ if we can find two distributions $P,Q$ over some common probability space such that $D_f(P,Q)=r, D_g(P,Q)=s$. Hence, the joint range exactly characterizes the relationship between different $f$-divergences. The central result is as follows:

Theorem 4 Let $\mathcal{R}\subseteq {\mathbb R}^2$ be the joint range of $(D_f, D_g)$, and $\mathcal{R}_k\subseteq {\mathbb R}^2$ be the joint range of $(D_f, D_g)$ when both $P$ and $Q$ are discrete distributions supported on at most $k$ elements. Then

$\mathcal{R} = \text{\rm conv}(\mathcal{R}_2) = \mathcal{R}_4,$

where $\text{\rm conv}$ denotes the convex hull.

Theorem 4 provides a generic way to prove inequalities between different $f$-divergences. Specifically, it shows that to obtain the joint range, it suffices to choose $P = (p, 1-p)$ and $Q = (q,1-q)$, compute the pair $(D_f(P,Q), D_g(P,Q))$, vary $(p,q)\in [0,1]^2$ and then take the convex closure. The proof of Theorem 4 requires the following theorems in the convex analysis.

Theorem 5 (Choquet–Bishop–de Leeuw) Let $C$ be a metrizable convex compact subset of a locally convex topological vector space $V$. For any point $x_0$ in $C$, there exists a probability measure $\mu$ supported on extremal points of $C$ such that $x_0 = \int x\mu(dx)$

Theorem 6 (Fenchel–Eggleston–Carathéodory) Let $S\subseteq {\mathbb R}^d$ be connected and $C=\text{\rm conv}(S)$. For any $x\in C$, there exists $x_1,\cdots,x_d\in S$ such that $x\in \text{\rm conv}(x_1,\cdots,x_d)$

We are ready to prove the statement of Theorem 4.

Proof: For simplicity we always assume that $P\ll Q$, then $D_f(P,Q) = \mathop{\mathbb E}_Q f(dP/dQ)$. Viewing the likelihood ratio $dP/dQ$ as a random variable $X$, we conclude that

$\mathcal{R} = \{(\mathop{\mathbb E}[f(X)], \mathop{\mathbb E}[g(X)]): X\ge 0, \mathop{\mathbb E}[X] = 1 \}.$

Similarly, $\mathcal{R}_k$ has the same representation with the additional constraint that $X$ can take at most $k$ values. Let $C$ be the set of all probability measures which are supported on $[0,\infty)$ and have unit mean. By linearity of the expectation and the convexity of $C$ we conclude that $\mathcal{R}$ is convex. Moreover, since all extremal points of $C$ are supported on two elements, by Choquet–Bishop–de Leeuw theorem (note that any topological vector space is always locally convex under the weak-$\star$ topology, where the unit sphere is compact and metrizable) we conclude that the distribution of $X$ can be written as a mixture of distributions supported on only two points. Then by the linearity of expectation again, we conclude that $\mathcal{R} = \text{conv}(\mathcal{R}_2)$

For the second identity $\text{conv}(\mathcal{R}_2)= \mathcal{R}_4$, first note that $\mathcal{R}_2$ is a connected set for it is a continuous image of the divergence pair on $[0,1]^2$. Hence, by Fenchel–Eggleston–Carathéodory theorem, any point in $\text{conv}(\mathcal{R}_2)$ can be written as a convex combination of two points in $\mathcal{R}_2$, which gives rise to a random variable $X$ supported on at most $4$ elements. $\Box$

Based on Theorem 4, one may verify the following inequalities:

• TV vs. Hellinger: $\frac{H^2(P,Q)}{2} \le \|P-Q\|_{\text{TV}} \le \sqrt{H^2(P,Q)\left(1-\frac{H^2(P,Q)}{4}\right)}.$
Both inequalities are tight.
• TV vs. KL: Pinsker’s inequality gives $\|P-Q\|_{\text{TV}} \le \sqrt{D_{\text{KL}}(P\|Q)/2}.$
This bound is vacuous when $D_{\text{KL}}(P\|Q)\ge 2$, where better bounds are available: $\|P-Q\|_{\text{TV}} \le \sqrt{1-\exp(-D_{\text{KL}}(P\|Q))} \le 1 - \frac{1}{2}\exp(-D_{\text{KL}}(P\|Q)).$
None of these bounds are tight.
• Hellinger vs KL: $D_{\text{KL}}(P\|Q) \ge 2\log\frac{2}{2 - H^2(P,Q)}.$
This bound is tight.
• KL vs. $\chi^2$: $D_{\text{KL}}(P\|Q) \le \log(1+\chi^2(P,Q)).$
This bound is tight.

The above inequalities will be used quite often whenever we switch between convenient divergences.

2. Le Cam’s Two-point Method

One popular way to prove a non-asymptotic lower bound of statistical estimation is via hypothesis testing. Let $(P_\theta)_{\theta\in\Theta}$ be the statistical model of interest, and the loss function is $L(\theta,\hat{\theta}): \Theta \times \Theta\rightarrow {\mathbb R}_+$ (at the moment we assume that the target is to estimate $\theta$). The idea is to use the following weaker lower bound

$\inf_{\hat{\theta}} \sup_{\theta\in\Theta} \mathop{\mathbb E}_\theta L(\theta,\hat{\theta}) \ge \sup_{\theta_1,\theta_2\in\Theta} \inf_{\hat{\theta}} \sup_{\theta\in \{\theta_1,\theta_2\} } \mathop{\mathbb E}_\theta L(\theta,\hat{\theta}).$

Suppose that we can find two possible candidates of $\theta$, say, $\theta_1$ and $\theta_2$, such that the following two conditions hold:

1. Large separation: the total loss $L(\theta_1, \hat{\theta}) + L(\theta_2, \hat{\theta})$ is always large for any $\hat{\theta}$
2. Indistinguishability: based on the observation, the observer cannot reliably distinguish between $P_{\theta_1}$ and $P_{\theta_2}$

then clearly the observer needs to pay a risk proportional to $\inf_{\hat{\theta}} (L(\theta_1, \hat{\theta}) + L(\theta_2, \hat{\theta}))$. The next theorem makes this intuition precise.

Theorem 7 (Two-point Method) Let $L: \Theta \times \mathcal{A} \rightarrow {\mathbb R}_+$ be any loss function, and $\theta_1,\theta_2\in \Theta$ satisfy that

$L(\theta_1, a) + L(\theta_2, a) \ge \Delta, \qquad \forall a\in \mathcal{A},$

then

$\inf_{\hat{\theta}} \sup_{\theta\in\Theta} \mathop{\mathbb E}_\theta L(\theta,\hat{\theta})\ge \frac{\Delta}{2}\cdot \left(1 - \|P_{\theta_1} - P_{\theta_2} \|_{\text{\rm TV}}\right).$

Proof: The desired result follows from the following chain of inequalities:

$\begin{array}{rcl} \inf_{\hat{\theta}} \sup_{\theta\in\Theta} \mathop{\mathbb E}_\theta L(\theta,\hat{\theta}) &\ge & \inf_{\hat{\theta}} \sup_{\theta\in \{\theta_1,\theta_2\} } \mathop{\mathbb E}_\theta L(\theta,\hat{\theta}) \\ &\ge & \inf_{\hat{\theta}} \frac{\mathop{\mathbb E}_{\theta_1} L(\theta_1,\hat{\theta}) + \mathop{\mathbb E}_{\theta_2} L(\theta_2,\hat{\theta}) }{2} \\ &=&\frac{1}{2} \inf_{\hat{\theta}} \int L(\theta_1, \hat{\theta}(x))P_{\theta_1}(dx) + \int L(\theta_2, \hat{\theta}(x))P_{\theta_2}(dx) \\ &\ge & \frac{1}{2} \int \left[\inf_{a} ( L(\theta_1,a) + L(\theta_2,a) ) \right]\min\{P_{\theta_1}(dx), P_{\theta_2}(dx) \} \\ &\ge& \frac{\Delta}{2}\cdot \left(1 - \|P_{\theta_1} - P_{\theta_2} \|_{\text{\rm TV}}\right). \end{array}$$\Box$

Note that in Theorem 7, the parameter $\Delta$ characterizes the separation between the two points, and $1 - \|P_{\theta_1} - P_{\theta_2} \|_{\text{\rm TV}}$ is the minimum error probability of testing between $\theta = \theta_1$ and $\theta=\theta_2$ (cf. Lemma 1). Typically these two quantities depend negatively on each other: when one becomes larger, the other gets smaller. A rule-of-thumb is to find $(\theta_1, \theta_2)$ such that $1 - \|P_{\theta_1} - P_{\theta_2} \|_{\text{\rm TV}}$ is a constant, say $0.1$, and then search for the largest possible separation $\Delta$

Although the proof of Theorem 7 did not explicitly apply testing arguments, the test can be seen as follows. Let the test be $\Psi(X) = 1$ if $L(\theta_1, \hat{\theta}) < L(\theta_2, \hat{\theta})$, and $\Psi(X)=2$ otherwise, then the separability condition ensures that $\theta = \theta_\Psi$ for $\theta\in \{\theta_1,\theta_2\}$ as long as $L(\theta,\hat{\theta}) < \Delta/2$. Hence,

$L(\theta, \hat{\theta} ) \ge \frac{\Delta}{2}\cdot 1(\theta\neq \theta_\Psi).$

Replacing $L(\theta,\hat{\theta})$ with the above lower bound, and repeating the analysis in the proof of Theorem 7, a direct application of Lemma 1 gives the same result (with $\Delta/2$ replaced by $\Delta/4$).

The two-point idea is simple but appears everywhere in various lower bounds, for it establishes the indistinguishability between two different tasks due to the uncertainties. Also, despite the conceptual simplicity, searching over all possible pairs of points is typically computationally expensive, and it requires effort to understand the problem structure and figure out the “correct” two points. In later lectures, this idea will be generalized in the following directions:

• Multiple points instead of two-point;
• One or both of the two points may be a mixture of distributions, i.e., the binary test has point-mixture or mixture-mixture structures;
• The chioce of the two points may depend on the estimator $\hat{\theta}$

3. Examples of Two-point Methods

The remainder of this lecture is devoted to the applications of the two-point method in various problems.

3.1. Example I: Parameter Estimation

We begin with the simplest example of the Gaussian location model $P_\theta = \mathcal{N}(\theta,\sigma^2)$ for $\theta\in{\mathbb R}$, and our target is to estimate $\theta$. Since a natural guess is that local perturbation on $\theta$ makes the model indistinguishable, we choose $\theta_1 = 0$ and $\theta_2 = \delta$ in the two-point method. Let $R^\star$ be the minimax risk under the squared error loss, Theorem 7 gives

$R^\star \ge \frac{\delta^2}{2}\left(1 - \Phi\left( \frac{|\delta|}{2\sigma}\right)\right),$

where $\Phi(\cdot)$ is the CDF of $\mathcal{N}(0,1)$. Maximizing the above lower bound over $\delta\in {\mathbb R}$ gives $R^\star\ge 0.332\sigma^2$. Compared with $R^\star = \sigma^2$ derived in Lecture 4, the two-point bound loses a constant multiplicative factor.

Next we consider the bounded mean scenario where $|\theta|\le B$ for a known parameter $B$ in advance. Since the above $\delta$ is chosen to be $\Theta(\sigma)$ in the end, this becomes infeasible if the upper bound $B$ is much smaller than $\sigma$. Consequently, we can choose $\theta_1 = 0$ and $\theta_2 = \sigma \wedge B$, which gives $R^\star = \Omega(\sigma^2\wedge B^2)$. This lower bound is also tight, for one of the estimators $\hat{\theta} = 0$ or $\hat{\theta}=X$ attains it.

Consider another location model where one observes $X_1,\cdots,X_n\sim \mathsf{Unif}(\theta,\theta+1)$ and aims to estimate $\theta\in {\mathbb R}$. Again, choosing $\theta_1 = 0$ and $\theta_2 = \delta$, the minimax risk $R_n^\star$ is lower bounded by

$R_n^\star \ge \frac{\delta^2}{2} \left(1 - \|P_0^{\otimes n} - P_\delta^{\otimes n} \|_{\text{\rm TV}} \right).$

By the triangle inequality,

$\|P_0^{\otimes n} - P_\delta^{\otimes n} \|_{\text{\rm TV}} \le n\|P_0 - P_\delta\|_{\text{\rm TV}} = n\delta.$

Hence, choosing $\delta = 1/(2n)$ gives $R_n^\star \ge 1/(16n^2)$. The rate $\Theta(n^{-2})$ is also tight for the minimax risk, for the estimator $\hat{\theta} = \min_i X_i$ attains the rate.

Remark 3 The uniform location model is not QMD (see Lecture 4), thus it makes sense to have the correct minimax rate different from $\Theta(n^{-1})$. Moreover, the inequality $\|P_0^{\otimes n} - P_\delta^{\otimes n} \|_{\text{\rm TV}} \le n\|P_0 - P_\delta\|_{\text{\rm TV}}$ may not be tight for other models.

3.2. Example II: Functional Estimation

Recall the discrete entropy estimation problem, where we observe $n$ samples $X_1,\cdots,X_n$ from a discrete distribution $P=(p_1,\cdots,p_k)$, and the target is to estimate the Shannon entropy $H(P) = \sum_{i=1}^k -p_i\log p_i$. In Lecture 4 we proved an asymptotic minimax lower bound $(\log k)^2/n$ for the mean squared error, and here we show that the same bound holds non-asymptotically (even when $k$ grows with $n$).

The idea of using two-point method here is that when we perturb each entry in the uniform distribution by some small $\varepsilon/k$, the value of the functional $H(P)$ changes by an additive factor of $\Theta(\varepsilon\log k)$, and the KL divergence between these two cases is $\Theta(n\varepsilon^2)$. Hence, choosing $\varepsilon \asymp n^{-1/2}$ will complete the proof. However, some cautions should be taken when choosing the specific points:

• The two points cannot be $P_1 = (\frac{1}{k},\cdots,\frac{1}{k})$, and $P_2 = (\frac{1+\varepsilon}{k}, \frac{1-\varepsilon}{k},\cdots, \frac{1+\varepsilon}{k}, \frac{1-\varepsilon}{k})$. In this way, we have $| H(P_1) - H(P_2) | = \Theta(\varepsilon^2)$
• The two points cannot be $P_1 = (\frac{1}{k},\cdots,\frac{1}{k})$, and $P_2 = (\frac{1-\varepsilon}{k},\cdots,\frac{1-\varepsilon}{k},\frac{1+(k-1)\varepsilon}{k})$. In this way, we have $|H(P_1) - H(P_2)|=\Theta(\varepsilon\log k)$ if $\varepsilon = \Omega(k^{-1})$ and $|H(P_1) - H(P_2)|=\Theta(\varepsilon^2)$ otherwise. Hence, this choice fails for some parameters $(n,k)$

The final choice of the two-point is as follows:

$\begin{array}{rcl} P_1 &=& \left(\frac{1}{2(k-1)}, \cdots, \frac{1}{2(k-1)}, \frac{1}{2} \right), \\ P_2 &=& \left(\frac{1-\varepsilon}{2(k-1)},\cdots, \frac{1-\varepsilon}{2(k-1)}, \frac{1 + \varepsilon}{2} \right). \end{array}$

Simple algebra gives $|H(P_1) - H(P_2)| = \Theta(\varepsilon\log k)$, and

$D_{\text{KL}}(P_1^{\otimes n} \| P_2^{\otimes n}) = nD_{\text{KL}}(P_1 \| P_2) = \frac{n}{2}\log \frac{1}{1-\varepsilon^2} = \Theta(n\varepsilon^2).$

Now choosing $\varepsilon \asymp n^{-1/2}$ completes the proof of $R_{n,k}^\star = \Theta((\log k)^2/n)$. This example shows that the choice of the two points may sometimes be a little bit delicate.

3.3. Example III: Non-local Choice of Two Points

We return to the entropy estimation problem, but this time we choose different points to arrive at an entirely different lower bound. Note that all above choices of the two points are based on local perturbations, and the resulting bounds are essentially for the variance. This example will show that a more sophisticated construction can also take care of the bias part.

For observations $X^n\sim P^{\otimes n}$, let $T=T(X^n)$ be the sorted histogram of $X^n$, i.e., $T$ is the sorted version of $(h_1,\cdots,h_k)$ with $h_j = \sum_{i=1}^n 1(X_i = j)$. The next lemma shows that there exists a minimax estimator which depends only on $T$

Lemma 8 For any estimator $\hat{H}(X^n)$ of the entropy $H(P)$, there exists an estimator $\hat{H}^\star(T)$ such that

$\sup_{P\in \mathcal{M}_k} \mathop{\mathbb E}_P(\hat{H}^\star(T) - H(P))^2 \le \sup_{P\in \mathcal{M}_k} \mathop{\mathbb E}_P(\hat{H}(X^n) - H(P))^2.$

Proof: The proof relies on the symmetry of $H(P)$ in permutations of $P$. Let $\sigma\in \mathcal{S}_k$ be any permutation on $[k]$, clearly $H(P) = H(P\circ \sigma)$, and the distribution of $T$ under $P$ is the same as that under $P\circ \sigma$. Hence, let $\sigma(X^n) = (\sigma(X_1), \cdots, \sigma(X_n))$, the estimator

$\hat{H}^\star = \frac{1}{k!}\sum_{\sigma\in\mathcal{S}_k} \hat{H}(\sigma(X^n))$

depends only on $T$, and the desired inequality follows from Jensen’s inequality. $\Box$

Lemma 8 shows that the model can be reduced to $P^{\otimes n}\circ T^{-1}$, which by data processing inequality results in a smaller total variation distance between two points. Now the new construction of the two points is motivated by the following lemma.

Lemma 9 Let $P=(p_1,\cdots,p_k)$ and $Q=(q_1,\cdots,q_k)$ be two probability vectors, with $\max_{i\in [k]} \max\{p_i, q_i \}\le 1/(40n)$. Then

$\|P^{\otimes n}\circ T^{-1} - Q^{\otimes n}\circ T^{-1} \|_{\text{\rm TV}} \le \frac{1}{2} + 5\sum_{m=1}^\infty n^m\left|\sum_{i=1}^k (p_i^m - q_i^m) \right|.$

The proof of Lemma 9 is referred to the bibliographic notes. Lemma 9 shows that if the lower moments of $P$ and $Q$ are equal, the total variation distance between their tensor powers will be small. This fact motivates us to consider distributions $P$ and $Q$ which are $k/d$ repetitions of the vectors

$\frac{1}{k}(x_1,\cdots,x_d) \quad \text{ and }\quad \frac{1}{k}(y_1,\cdots,y_d),$

respectively, where $d>0$ is some fixed integer, and

1. $P$ and $Q$ are valid distributions: $\sum_{i=1}^d x_i = \sum_{i=1}^d y_i = d$
2. $P$ and $Q$ have matched moments: $\sum_{i=1}^d x_i^m = \sum_{i=1}^d y_i^m, m=0,1,\cdots,d-1$
3. $P$ and $Q$ have different entropies: $\sum_{i=1}^d x_i\log x_i\neq \sum_{i=1}^d y_i\log y_i$

The existence of such vectors $x,y$ is ensured by the following lemma.

Lemma 10 For any positive integer $d$, there exist vectors $x$ and $y$ satisfying the above properties.

Proof: Fix any $d$-dimensional vector $x$ with distinct positive entries. Consider

$p_\Delta(x) = (x-x_1)(x-x_2)\cdots(x-x_d) - \Delta,$

which for $\Delta>0$ small enough clearly has $d$ positive zeros. Let $y_1,\cdots,y_d$ be the zeros of $p_\Delta(x)$, then comparing the coefficients of $x,x^2,\cdots,x^{d-1}$ on both sides of the polynomial and invoking Newton’s identity, vectors $x$ and $y$ have matched moments up to $d-1$. The first property is ensured via simple normalizations, and it suffices to choose a suitable $\Delta$ to ensure the third property. $\Box$

Using the above two points, we have $|H(P) - H(Q)| = \Theta(1)$ and $\sum_{i=1}^k p_i^m = \sum_{i=1}^k q_i^m$ for any $m=1,\cdots,d-1$. Moreover, for $m\ge d$, we have

$\left|\sum_{i=1}^k (p_i^m - q_i^m) \right| = O\left( k^{1-m} \right).$

Hence, by Lemma 9,

$\|P^{\otimes n}\circ T^{-1} - Q^{\otimes n}\circ T^{-1} \|_{\text{\rm TV}} \le \frac{1}{2} + O\left( \sum_{m=d}^\infty n^m k^{1-m} \right) = \frac{1}{2} + O\left(\frac{k(n/k)^d}{1-(n/k)} \right),$

which is strictly smaller than one if $n = O(k^{1-1/d} )$. Note that the choice of the integer $d$ is arbitrary, we conclude that the minimax risk $R_{n,k}^\star$ is at least a positive constant whenever $n=O(k^{1-\varepsilon})$ for any $\varepsilon>0$. This lower bound says nothing about $R_{n,k}^\star$ when $n$ gets larger, but compared with the previous bound $R_{n,k}^\star = \Theta((\log k)^2/n)$, the new bound shows that $n$ actually needs to be much larger than $(\log k)^2$ for the consistent estimation.

In later lectures, it will be shown that the minimax risk of this problem is

$R_{n,k}^\star = \Theta\left(\left(\frac{k}{n\log n}\right)^2 + \frac{(\log k)^2}{n} \right).$

Hence, the current bound is slightly weaker than the first term, and the old bound is exactly the second term.

3.4. Example IV: Robust Statistics

In this section we look at the Gaussian mean estimation again, but now with outliers. Specifically, we consider the following Huber’s $\varepsilon$-contamination model:

$P_{(\theta,Q)} = (1-\varepsilon)\mathcal{N}(\theta, I_p) + \varepsilon Q,$

where $Q$ denotes the outlier distribution and may be arbitrary. The robust estimation problem is that, given $n$ i.i.d. observations from $P_\theta$, propose a robust estimator which minimizes the worst-case risk of estimating $\theta$, where the worst case is taken over both $\theta\in{\mathbb R}^p$ and the outlier distribution $Q$. Under the squared $L_2$ loss, let $R_{n,\varepsilon}^\star$ be the minimax risk with $n$ samples and outlier probability $\varepsilon$

The central claim of this section is that

$R_{n,\varepsilon}^\star = \Omega\left(\frac{p}{n} + \varepsilon^2 \right).$

This lower bound is tight for the Tukey median and some other estimators; we refer to the bibliographic notes. Since the $\Omega(p/n)$ lower bound even holds when $\varepsilon = 0$, we may restrict our attention to showing that $R_{n,\varepsilon}^\star = \Omega(\varepsilon^2)$

To apply the two-point method, we need to choose both $(\theta_1,\theta_2)$ and $(Q_1,Q_2)$, where the latter is more important in this robust setting. Since $Q_1$ and $Q_2$ can be arbitrary probability distributions, we actually have a large degree of freedom and can actually achieve perfect ambiguity here. This phenomenon is summarized in the following lemma.

Lemma 11 Let $P_1, P_2$ be any probability distributions with $\| P_1 - P_2 \|_{\text{\rm TV}}\le \varepsilon/(1-\varepsilon)$. Then there exist probability distributions $Q_1, Q_2$ such that

$(1-\varepsilon)P_1 +\varepsilon Q_1 = (1-\varepsilon)P_2 + \varepsilon Q_2.$

Proof: Consider the signed measure $Q=\frac{1-\varepsilon}{\varepsilon}(P_2 - P_1)$. Clearly $Q(\Omega)=0$, and by assumption $|Q|(\Omega) \le 2$, where $|Q|$ denotes the total variation measure of $Q$. Hence, by Jordan decomposition of signed measures, we have $Q = Q_1 - Q_2$, where measures $Q_1, Q_2$ are mutually singular and $Q_1(\Omega) = Q_2(\Omega)\le 1$. Finally, adding any common measure $\mu$ with $\mu(\Omega)=1-Q_1(\Omega)$ to both $Q_1, Q_2$ completes the proof. $\Box$

By Lemma 11, it suffices to find $(\theta_1, \theta_2)$ such that

$\| \mathcal{N}(\theta_1, I_p) - \mathcal{N}(\theta_2, I_p) \|_{\text{\rm TV}} = 2\Phi\left(\frac{\|\theta_1 - \theta_2\|_2}{2} \right) - 1 \le \frac{\varepsilon}{1-\varepsilon},$

and the incurred loss is $\Omega(\|\theta_1 - \theta_2\|_2^2)$. Hence, the lower bound $\Omega(\varepsilon^2)$ is completed by choosing $\|\theta_1 - \theta_2\|_2 \asymp \varepsilon$

3.5. Example V: Two-armed Bandits

In this section we study the simplest online learning problem, namely, the two-armed bandit. At each time $t\in [T]$, the learner can pull one of the two arms, and the reward of pulling arm $i=1,2$ follows a standard normal distribution $r_{t,i}\sim \mathcal{N}(\mu_i,1)$, where the mean parameters $\mu_1, \mu_2$ are unknown to the learner. A policy $\pi = (\pi_t)_{t\in [T]}$ is a sequence where $\pi_t\in \{1,2\}$ denotes the arm to pull at time $t$, where $\pi_t$ can depend on all past policies $(\pi_\tau)_{\tau\in [t-1]}$ and past rewards $(r_{\tau,\pi_\tau})_{\tau\in [t-1]}$. The regret of a policy $\pi$ is defined as the difference of the cumulative reward of the best arm and that of the policy, i.e.,

$R_T(\pi) := T\max\{\mu_1,\mu_2\} - \sum_{t=1}^T \mu_{\pi_t}.$

The target of the online learning is to devise a policy $\pi$ with a small expected regret, where the expectation is taken over the randomness of the rewards (and therefore the random realizations of the policy). A fundamental tradeoff in online learning is the exploration-exploitation tradeoff, where the learner needs to pull the suboptimal arm sufficiently many times to identify the best arm with better precision, and also needs to exploit the current knowledge to focus on the arm which seems to be the best.

In this section we aim to prove minimax lower bounds of the regrets in the two-armed bandits. First consider the case where $\mu_1, \mu_2$ can take any values in $[0,1]$. The two-point method suggests that we may find two worlds such that

1. there is no fixed policy whose regrets are small in both worlds;
2. these two worlds are hard to distinguish.

To do so, we may take $(\mu_1,\mu_2) = (\Delta,0)$ in the first world, and $(\mu_1,\mu_2) = (0,\Delta)$ in the second world, where $\Delta>0$ is some parameter to be chosen. Note that in both worlds, pulling a wrong arm incurs an instanteous regret of $\Delta$. Let $P_1^t$ and $P_2^t$ be the distribution of the rewards up to time $t$ in respective worlds, then the minimax regret is lower bounded by

$R_T^\star \ge \Delta \sum_{t=1}^T \frac{P_1^{t-1}(\pi_t = 2) + P_2^{t-1}(\pi_t = 1) }{2}.$

Using Lemma 1 and

$\|P_1^t - P_2^t\|_{\text{\rm TV}}^2 \le \frac{1}{2}D_{\text{KL}}(P_1^t\|P_2^t) = \frac{t\Delta^2}{4} \le \frac{T\Delta^2}{4},$

we conclude that $R_T^\star = \Omega(\sqrt{T})$ by choosing $\Delta \asymp T^{-1/2}$

We also consider the case where there is a known suboptimality gap $\Delta$ between arms, i.e., $|\mu_1 - \mu_2| \ge \Delta>0$. Using the above arguments and the inequality

$\|P-Q\|_{\text{TV}} \le 1 - \frac{1}{2}\exp(-D_{\text{KL}}(P\|Q) ),$

we will arrive at

$R_T^\star = \Omega\left(\sum_{t=1}^T \Delta\exp(-t\Delta^2) \right) = \Omega\left(\frac{1-\exp(-T\Delta^2)}{\Delta}\right),$

which converges to a constant $1/\Delta$ as $T\rightarrow\infty$. However, this lower bound is not tight – we will show that $R_T^\star \sim (\log T)/\Delta$ as $T$ gets large. To show this stronger lower bound, we need a more delicate construction of the two points.

Set $(\mu_1, \mu_2) = (\Delta,0)$ in world I, and $(\mu_1,\mu_2) = (\Delta,2\Delta)$ in world II. Let $P_1^t, P_2^t$ be defined as before, then

$D_{\text{KL}}(P_1^t \| P_2^t) \le 2\Delta^2\cdot \mathop{\mathbb E}_1[T_2],$

where $T_2$ is the total number of pulls of arm $2$, and $\mathop{\mathbb E}_1$ denotes the expectation under world I. Hence, applying the previous arguments, we have

$R_T^\star = \Omega(T\Delta\cdot \exp(-2\Delta^2 \mathop{\mathbb E}_1[T_2])).$

Clearly, this is a better bound for small $\mathop{\mathbb E}_1[T_2]$, while it is less useful when $\mathop{\mathbb E}_1[T_2]$ is large. However, a large $\mathop{\mathbb E}_1[T_2]$ implies that the learner pulls the suboptimal arm (i.e., arm 2) too many times in world I, and in mathematical words we have

$R_T^\star = \Omega(\Delta\cdot \mathop{\mathbb E}_1[T_2]).$

Note that these two bounds exactly characterize the tradeoff between exploitation and exploration. Combining these bounds, we have

$R_T^\star = \Omega(\Delta\cdot \mathop{\mathbb E}_1[T_2] + T\Delta\cdot \exp(-2\Delta^2 \mathop{\mathbb E}_1[T_2]) ) = \Omega\left(\frac{\max\{1, \log(T\Delta^2) \} }{\Delta} \right),$

where the second inequality follows from the minimization of the inner sum with respect to $\mathop{\mathbb E}_1[T_2]\in [0,T]$

In summary, we have two different lower bounds: when there is no assumption on the optimality gap, then $R_T^\star = \Omega(\sqrt{T})$; when the optimality gap is at least $\Delta>0$, then $R_T^\star = \Omega(\max\{1, \log(T\Delta^2) \} / \Delta )$. Both bounds are tight and there exist policies attaining these bounds; we refer to bibliographic notes for details.

3.6. Example VI: Multi-armed Bandits

Interestingly, the two-point method can also give the correct minimax lower bound of the regret for general multi-armed bandits (in later lectures, we will provide a proof using multiple points). However, the following application of the two-point method differs from all previous ones. In previous examples, we fix two points and prove that all estimators (or policies) cannot achieve small risk (or regret) under both points. In this example, the construction of the two points depends on the policy.

We first define the multi-armed bandit problem. In fact, the only difference from the two-armed case is that now there are $K$ arms in total, and the $i$-th arm has reward distribution $\mathcal{N}(\mu_i,1)$. The regret of a given policy $\pi$ is defined as

$R_T(\pi) = T\max_{i\in [K]} \mu_i - \sum_{t=1}^T \mu_{\pi_t}.$

To apply the two-point method, let the first point be $(\mu_1,\cdots,\mu_K)=(\Delta,0,\cdots,0)$. Clearly, the first arm is the optimal arm, and pulling any wrong arm incurs an instanteous regret $\Delta$. For any given policy $\pi$, the choice of the second point will depend on the policy: for $i\neq 1$, let $\mathop{\mathbb E}[T_i]$ be the expected number of pulls of arm $i$ for policy $\pi$ under the first point. Since $\sum_{i=2}^K \mathop{\mathbb E}[T_i]\le T$, there must exist some $i_0\neq 1$ such that $\mathop{\mathbb E}[T_{i_0}]\le T/(K-1)$. In other words, under the first point, the policy $\pi$ does not pull arm $i_0$ much in expectation. Note that $i_0$ depends on the policy $\pi$

Now the second point is as follows: $(\mu_1,\cdots,\mu_K) = (\Delta,0,\cdots,0,2\Delta,0,\cdots,0)$, where the only difference from the first point is that $\mu_{i_0}=2\Delta$. We claim that the two points are hard to distinguish: let $P_1, P_2$ be the distribution of all observed rewards under these two points, respectively, then

$D_{\text{KL}}(P_1 \| P_2) = 2\Delta^2\cdot \mathop{\mathbb E}[T_{i_0}] \le \frac{2\Delta^2 T}{K-1}.$

Hence, by choosing $\Delta \asymp \sqrt{K/T}$ we have $D_{\text{KL}}(P_1 \| P_2) = O(1)$, and consequently the worst-case regret for the policy $\pi$ is at least $\Omega(T\Delta) = \Omega(\sqrt{KT})$. By the arbitrariness of $\pi$, this is a minimax lower bound. This bound is tight for multi-armed bandits.

The idea of constructing points based on given approaches is widely used in problems where several rounds of adaptivity are available. One future lecture will be devoted exclusively to this topic.

4. Bibliographic Notes

The lower bound technique based on hypothesis testing was pioneered by Ibragimov and Khas’minskii (1977) (see also Le Cam (1973) for the binary case). The general $f$-divergence was proposed by Csiszár (1967), and the joint range of $f$-divergences (Theorem 4) was established in Harremoës and Vajda (2011). For the results in convex analysis, the Choquet–Bishop–de Leeuw theorem can be found in Phelps (2001), and the Fenchel–Eggleston–Carathéodory theorem can be found in Eggleston (1958). The inequalities between $f$-divergences and the two-point method can be mostly found in the excellent textbook Tsabykov (2009).

For the examples, the first lower bound of entropy estimation is taken from Wu and Yang (2016). For the second lower bound, Lemma 9 was proved in Valiant (2008) under the Poissonized model, and Lemma 10 is taken from Archaya et al. (2016). For robust statistics, Huber’s $\varepsilon$-contamination model dates back to Huber (1964), and the general lower bound technique in Lemma 11 is taken from Chen, Gao and Ren (2016). For the robust estimators achieving the lower bound, see Chen, Gao and Ren (2018) and Ilias et al. (2018).

The lower bounds of bandit problems are taken from Voger (1960) and Lai and Robbins (1985). We also refer to Bubeck, Perchet and Rigollet (2013) and Lattimore and Szepesvári (2018) for non-asymptotic bounds. For more on bandit problems such as the upper bounds, we refer to the book Bubeck and Cesa-Bianchi (2012).

1. Il’dar Abdullovich Ibragimov and Rafail Zalmanovich Khas’minskii. On the estimation of an infinite-dimensional parameter in Gaussian white noise. Doklady Akademii Nauk. Vol. 236. No. 5. Russian Academy of Sciences, 1977.
2. Lucien Le Cam. Convergence of estimates under dimensionality restrictions. The Annals of Statistics 1.1 (1973): 38-53.
3. Imre Csiszár. Information-type measures of difference of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica 2 (1967): 229-318.
4. Peter Harremoës and Igor Vajda. On pairs of $f$-divergences and their joint range. IEEE Transactions on Information Theory 57.6 (2011): 3230-3235.
5. Robert R. Phelps. Lectures on Choquet’s theorem. Springer Science & Business Media, 2001.
6. H. G. Eggleston. Convexity. Cambridge University Press, 1958.
7. Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009.
8. Yihong Wu and Pengkun Yang, Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory 62.6 (2016): 3702-3720.
9. Paul Valiant. Testing symmetric properties of distributions. SIAM Journal on Computing 40.6 (2011): 1927-1968.
10. Jayadev Acharya et al. Estimating Rényi entropy of discrete distributions. IEEE Transactions on Information Theory 63.1 (2016): 38-56.
11. Peter J. Huber. Robust estimation of a location parameter. The Annals of Mathematical Statistics (1964): 73-101.
12. Mengjie Chen, Chao Gao, and Zhao Ren. A general decision theory for Huber’s $\varepsilon$-contamination model. Electronic Journal of Statistics 10.2 (2016): 3752-3774.
13. Mengjie Chen, Chao Gao, and Zhao Ren. Robust covariance and scatter matrix estimation under Huberâ€™s contamination model. The Annals of Statistics 46.5 (2018): 1932-1960.
14. Ilias Diakonikolas et al. Robustly learning a gaussian: Getting optimal error, efficiently. Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, 2018.
15. Walter Vogel. A sequential design for the two armed bandit. The Annals of Mathematical Statistics 31.2 (1960): 430-443.
16. Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6.1 (1985): 4-22.
17. Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet. Bounded regret in stochastic multi-armed bandits. Conference on Learning Theory, 2013.
18. Tor Lattimore and Csaba Szepesvári. Bandit algorithms. preprint (2018).
19. Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning 5.1 (2012): 1-122.

# Lecture 4: Local Asymptotic Normality and Asymptotic Theorems

(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.)

In this lecture, we establish the asymptotic lower bounds for general statistical decision problems. Specifically, we show that models satisfying the benign local asymptotic normality (LAN) condition are asymptotically equivalent to a Gaussian location model, under which the Hájek–Le Cam local asymptotic minimax (LAM) lower bound holds. We also apply this theorem to both parametric and nonparametric problems.

1. History of Asymptotic Statistics

To begin with, we first recall the notions of the score function and Fisher information, which can be found in most textbooks.

Definition 1 (Fisher Information) A family of distributions $(P_\theta)_{\theta\in\Theta\subseteq {\mathbb R}^d}$ on $\mathcal{X}$ is quadratic mean differentiable (QMD) at $\theta\in {\mathbb R}^d$ if there exists a score function $\dot{\ell}_\theta: \mathcal{X} \rightarrow {\mathbb R}^d$ such that

$\int\left(\sqrt{dP_{\theta+h}} - \sqrt{dP_\theta} - \frac{1}{2}h^\top \dot{\ell}_\theta \sqrt{dP_\theta} \right)^2 = o(\|h\|^2).$

In this case, the matrix $I(\theta) := \mathop{\mathbb E}_{P_\theta}[\dot{\ell}_\theta\dot{\ell}_\theta^\top ]$ exists and is called the Fisher information at $\theta$

R. A. Fisher popularized the above Fisher information and the usage of the maximum likelihood estimator (MLE) starting from 1920s. He believes that, the Fisher information of a statistical model characterizes the fundamental limits of estimating $\theta$ based on $n$ i.i.d. observations from $P_\theta$, and MLE asymptotically attains this limit. More specifically, he makes the following conjectures:

1. For any asymptotically normal estimators $\hat{\theta}_n$ such that $\sqrt{n}(\hat{\theta}_n - \theta) \overset{d}{\rightarrow} \mathcal{N}(0, \Sigma_\theta)$ for any $\theta\in \Theta\subseteq {\mathbb R}^d$, there must be $\Sigma_\theta \succeq I(\theta)^{-1}, \qquad \forall \theta\in \Theta. \ \ \ \ \ (1)$
2. The MLE satisfies that $\Sigma_\theta = I(\theta)^{-1}$ for any $\theta\in\Theta$

Although the second conjecture is easier to establish assuming certain regularity conditions, the first conjecture, which seems to be correct by the well-known Cramér-Rao bound, actually caused some trouble when people tried to prove it. The following example shows that (1) may be quite problematic.

Example 1 Here is a counterexample to (1) proposed by Hodges in 1951. Let $\Theta = {\mathbb R}$, and consider a Gaussian location model where $X_1,\cdots,X_n$ are i.i.d. distributed as $\mathcal{N}(\theta,1)$. A natural estimator of $\theta$ is the empirical mean $\bar{X}_n = \sum_{i=1}^n X_i/n$, and the Fisher information is $I(\theta)\equiv 1$. The Hodges’ estimator is constructed as follows:

$\hat{\theta}_n = \bar{X}\cdot 1(|\bar{X}| \ge n^{-1/4}). \ \ \ \ \ (2)$

It is easy to show that $\sqrt{n}(\hat{\theta}_n - \theta) \overset{d}{\rightarrow} \mathcal{N}(0, \Sigma_\theta)$ for any $\theta\in {\mathbb R}$, with $\Sigma_\theta = 1$ for non-zero $\theta$ and $\Sigma_\theta = 0$ for $\theta=0$. Consequently, (1) does not hold for the Hodges’ estimator. The same result holds if the threshold $n^{-1/4}$ in (2) is changed by any sequence $a_n$ with $a_n = o(1)$ and $a_n = \omega(n^{-1/2})$

Hodges’ example suggests that (1) should at least be weakened in appropriate ways. Observing the structure of Hodges’ estimator (2) carefully, there can be three possible attempts:

1. The estimator $\hat{\theta}_n$ is superefficient (i.e., where (1) fails) only at one point $\theta=0$. It may be expected that the set of $\theta$ such that (1) fails is quite small.
2. Although the Hodges’ estimator satisfies the asymptotic normal condition, i.e., $\sqrt{n}(\hat{\theta}_n - \theta)$ weakly converges to a normal distribution under $P_\theta^{\otimes n}$, for any non-zero perturbation $h\in {\mathbb R}$, the sequence $\sqrt{n}(\hat{\theta}_n - \theta - h/\sqrt{n})$ does not weakly converge to the same distribution under $P_{\theta+h/\sqrt{n}}^{\otimes n}$. Hence, we may expect that (1) actually holds for more regular estimator sequences.
3. Let $R(\theta)$ be the risk function of the estimator $\hat{\theta}_n$ under the absolute value loss. It can be computed that $R(0) = o(1/\sqrt{n})$, while $R(\theta) = \Omega(|\theta|)$ for $|\theta| = O(n^{-1/4})$. In other words, the worst-case risk over an interval of size $n^{-1/2}$ around $\theta=0$ is still of the order $\Omega(n^{-1/2})$, which is considerably larger than the single point $\theta=0$. Therefore, it may make sense to consider the local minimax risk.

It turns out that all these attempts can be successful, and the following theorem summarizes the key results of asymptotic statistics developed by J. Hájek and L. Le Cam in 1970s.

Theorem 2 (Asymptotic Theorems) Let $(P_\theta)_{\theta\in\Theta\subseteq {\mathbb R}^d}$ be a QMD statistical model which admits a non-singular Fisher information $I(\theta_0)$ at $\theta_0$. Let $\psi(\theta)$ be differentiable at $\theta=\theta_0$, and $T_n$ be an estimator sequence of $\psi(\theta)$ in the model $(P_\theta^{\otimes n})$.

1. (Almost everywhere convolution theorem) If $\sqrt{n}(T_n - \psi(\theta))$ converges in distribution to some probability measure $L_\theta$ for every $\theta$, and $I(\theta)$ is non-singular for every $\theta$, then there exists some probability measure $M$ such that $L_{\theta} = \mathcal{N}(0, \dot{\psi}_{\theta} I(\theta)^{-1}\dot{\psi}_{\theta}^\top) * M$
for Lebesgue almost every $\theta$, where $*$ denotes the convolution.

2. (Convolution theorem) If $\sqrt{n}(T_n - \psi(\theta))$ converges in distribution to some probability measure $L_\theta$ for $\theta = \theta_0$, and $T_n$ is regular in the sense that $\sqrt{n}(T_n - \psi(\theta+h/\sqrt{n}))$ weakly converges to the same limit under $P_{\theta+h/\sqrt{n}}^{\otimes n}$ for every $h\in {\mathbb R}^d$, then there exists some probability measure $M$ such that $L_{\theta_0} = \mathcal{N}(0, \dot{\psi}_{\theta_0} I(\theta_0)^{-1}\dot{\psi}_{\theta_0}^\top) * M.$

3. (Local asymptotic minimax theorem) Let $\ell$ be a bowl-shaped loss function, i.e., $\ell$ is non-negative, symmetric and quasi-convex. In mathematical words, $\ell(x) = \ell(-x)\ge 0$ and the sublevel sets $\{x: \ell(x)\le t \}$ are convex for all $t\in {\mathbb R}$. Then $\lim_{c\rightarrow\infty} \liminf_{n\rightarrow\infty} \sup_{\|h\|\le c } \mathop{\mathbb E}_{\theta_0+\frac{h}{\sqrt{n}}} \ell\left(\sqrt{n}\left(T_n - \psi\left(\theta + \frac{h}{\sqrt{n}}\right) \right) \right) \ge \mathop{\mathbb E} \ell(Z)$
with $Z\sim \mathcal{N}(0, \dot{\psi}_{\theta_0} I(\theta_0)^{-1}\dot{\psi}_{\theta_0}^\top)$

We will be primarily interested in the local asymptotic minimax (LAM) theorem, for it directly gives general lower bounds for statistical estimation. This theorem will be proved in the next two sections using asymptotic equivalence between models, and some applications will be given in the subsequent section.

2. Gaussian Location Model

In this section we study the possibly simplest statistical model, i.e., the Gaussian location model, and will show in the next section that all regular models will converge to it asymptotically. In the Gaussian location model, we have $\theta\in {\mathbb R}^d$ and observes $X\sim \mathcal{N}(\theta,\Sigma)$ with a known non-singular covaraince $\Sigma$. Consider a bowl-shaped loss function $\ell$ (defined in Theorem 2), a natural estimator of $\theta$ is $\hat{\theta}=X$, whose worst-case risk is $\mathop{\mathbb E} \ell(Z)$ with $Z\sim \mathcal{N}(0,\Sigma)$. The main theorem in this section is that the natural estimator $\hat{\theta}=X$ is minimax.

Theorem 3 For any bowl-shaped loss $\ell$, we have

$\inf_{\hat{\theta}} \sup_{\theta\in {\mathbb R}^d} \mathop{\mathbb E}_\theta \ell( \hat{\theta} - \theta ) = \mathop{\mathbb E} \ell(Z)$

for $Z\sim \mathcal{N}(0,\Sigma)$

The proof of Theorem 3 relies on the following important lemma for Gaussian random variables.

Lemma 4 (Anderson’s Lemma) Let $Z\sim \mathcal{N}(0,\Sigma)$ and $\ell$ be bowl-shaped. Then

$\min_{x\in{\mathbb R}^d} \mathop{\mathbb E} \ell(Z+x) = \mathop{\mathbb E} \ell(Z).$

Proof: For $t\ge 0$, let $K_t = \{z: \ell(z)\le t \} \subseteq {\mathbb R}^d$. Since $\ell$ is bowl-shaped, the set $K_t$ is convex. Moreover, since

$\mathop{\mathbb E} \ell(Z+x) = \int_0^\infty (1 - \mathop{\mathbb P}(Z\in K_t - x) )dt,$

it suffices to show that $\mathop{\mathbb P}(Z\in K_t - x )\le \mathop{\mathbb P}(Z\in K_t)$ for any $x\in {\mathbb R}^d$. We shall need the following inequality in convex geometry.

Theorem 5 (Prépoka-Leindler Inequality, or Functional Brunn-Minkowski Inequality) Let $\lambda\in (0,1)$ and $f,g,h$ be non-negative real-valued measurable functions on ${\mathbb R}^d$, with

$h((1-\lambda)x + \lambda y) \ge f(x)^{1-\lambda} g(y)^\lambda, \qquad \forall x,y\in {\mathbb R}^d.$

Then

$\int_{{\mathbb R}^d} h(x)dx \ge \left( \int_{{\mathbb R}^d} f(x)dx\right)^{1-\lambda} \left( \int_{{\mathbb R}^d} g(x)dx\right)^\lambda.$

Let $\phi$ be the density function of $Z$, which is log-concave. Consider functions

$f(z) = \phi(z)\cdot 1_{K_t+x} (z), \quad g(z) = \phi(z)\cdot 1_{K_t-x}(z), \quad h(z) = \phi(z)\cdot 1_{K_t}(z)$

and $\lambda=1/2$, the log-concavity of $\phi$ and $(K_t-x)/2 + (K_t+x)/2 = K_t/2 + K_t/2 = K_t$ by convexity of $K_t$ ensures the condition of Theorem 5. Hence, Theorem 5 gives

$\mathop{\mathbb P}(Z\in K_t) \ge \sqrt{\mathop{\mathbb P}(Z\in K_t+x) \cdot \mathop{\mathbb P}(Z\in K_t-x) }.$

Finally, by symmetry of $Z$ and $K_t$, we have $\mathop{\mathbb P}(Z\in K_t+x) = \mathop{\mathbb P}(Z\in K_t-x)$, which completes the proof. $\Box$

Now we are ready to prove Theorem 3.

Proof: Consider a Gaussian prior $\pi=\mathcal{N}(0,\Sigma_0)$ on $\theta$. By algebra, the posterior distribution of $\theta$ given $X=x$ is Gaussian distributed as $\mathcal{N}( (\Sigma_0^{-1}+\Sigma^{-1})^{-1}\Sigma^{-1}x, (\Sigma_0^{-1}+\Sigma^{-1})^{-1} )$. By Proposition 3 in Lecture 3 and the above Anderson’s lemma, the Bayes estimator is then $\hat{\theta} = (\Sigma_0^{-1}+\Sigma^{-1})^{-1}\Sigma^{-1}X$. Since the minimax risk is lower bounded by any Bayes risk (as the maximum is no less than the average), we have

$\inf_{\hat{\theta}} \sup_{\theta\in {\mathbb R}^d} \mathop{\mathbb E}_\theta \ell( \hat{\theta} - \theta ) \ge \int \ell d\mathcal{N}(0, (\Sigma_0^{-1}+\Sigma^{-1})^{-1} ).$

Since this inequality holds for any $\Sigma_0$, choosing $\Sigma_0 = \lambda I$ with $\lambda\rightarrow\infty$ completes the proof of Theorem 3. $\Box$

3. Local Asymptotic Minimax Theorem

In this section, we show that regular statistical models converge to a Gaussian location model asymptotically. To prove so, we shall need verifiable criterions to establish the convergence of Le Cam’s distance, as well as the specific regularity conditions.

3.1. Likelihood Ratio Criteria for Asymptotic Equivalence

In Lecture 3 we introduced the notion of Le Cam’s model distance, and showed that it can be upper bounded via the randomization criterion. However, designing a suitable transition kernel between models is too ad-hoc and sometimes challenging, and it will be helpful if simple criteria suffice.

The main result of this subsection is the following likelihood ratio criteria:

Theorem 6 Let $\mathcal{M}_n = (\mathcal{X}_n, \mathcal{F}_n, \{P_{n,0},\cdots, P_{n,m}\})$ and $\mathcal{M} = (\mathcal{X}, \mathcal{F}, \{P_0, \cdots, P_m \})$ be finite statistical models. Further assume that $\mathcal{M}$ is homogeneous in the sense that any pair in $\{P_0,\cdots,P_m\}$ is mutually absolutely continuous. Define

$L_{n,i}(x_n) = \frac{dP_{n,i}}{dP_{n,0}}(x_n), \qquad i\in [m]$

as the likelihood ratios, and similarly for $L_i(x)$. Then $\lim_{n\rightarrow\infty} \Delta(\mathcal{M}_n, \mathcal{M})=0$ if the distribution of $(L_{n,1}(X_n),\cdots, L_{n,m}(X_n))$ under $X_n\sim P_{n,0}$ weakly converges to that of $(L_1(X),\cdots,L_m(X))$ under $X\sim P_0$

In other words, Theorem 6 states that a sufficient condition for asymptotic equivalence of models is the weak convergence of likelihood ratios. Although we shall not use that, this is also a necessary condition. The finiteness assumption is mainly for technical purposes, and the general case requires proper limiting arguments.

To prove Theorem 6, we need the following notion of standard models

Definition 7 (Standard Model) Let $\mathcal{S}_{m+1} = \{(t_0,\cdots,t_m)\in {\mathbb R}_+^{m+1}: \sum_{i=0}^m t_i=m+1 \}$, and $\sigma(\mathcal{S}_{m+1})$ be its Borel $\sigma$-algebra. A standard distribution $\mu$ on $(\mathcal{S}_{m+1},\sigma(\mathcal{S}_{m+1}))$ is a probability measure such that $\mathop{\mathbb E}_{\mu}[t_i]= 1$ for any $i=0,1,\cdots,m$. The model

$\mathcal{N} = (\mathcal{S}_{m+1},\sigma(\mathcal{S}_{m+1}), \{Q_0,\cdots, Q_m \}), \qquad \text{with}\quad dQ_i = t_id\mu,$

is called the standard model of $\mu$

The following lemma shows that any finite statistical model can be transformed into an equivalent standard form.

Lemma 8 Let $\mathcal{M} = (\mathcal{X}, \mathcal{F}, \{P_0, \cdots, P_m \})$ be a finite model, and $\mathcal{N}$ be a standard model with standard distribution $\mu$ being the distribution of $t:=(\frac{dP_0}{d\overline{P}},\cdots,\frac{dP_m}{d\overline{P}})\in \mathcal{S}_{m+1}$ under mean measure $\overline{P}=\sum_{i=0}^m P_i/(m+1)$. Then $\Delta(\mathcal{M},\mathcal{N})=0$

Proof: Since $\mathop{\mathbb E}_{\mu}[t_i] = \mathop{\mathbb E}_{\overline{P}}[dP_i/d\overline{P}]=1$, the measure $\mu$ is a standard distribution. Moreover, let $Q_i$ be the distribution of $t$ under $P_i$, then

$\mathop{\mathbb E}_{Q_i}[h(t)] = \mathop{\mathbb E}_{\overline{P}}\left[h(t)\frac{dP_i}{d\overline{P}} \right] = \mathop{\mathbb E}_{\mu}\left[h(t) t_i \right]$

for any measurable function $h$, which gives $dQ_i = t_id\mu$, agreeing with the standard model. Finally, since $dP_i(x) = t_i d\mu(x)$, by the factorization criterion (e.g., Theorem 7 in Lecture 3) we conclude that the statistic $t$ is sufficient, and therefore $\Delta(\mathcal{M},\mathcal{N})=0$. $\Box$

Lemma 8 helps to convert the sample space of all finite models to the simplex $\mathcal{S}_{m+1}$, and comparison between models is reduced to the comparison between their standard distributions. Consequently, we have the following quantitative bound on the model distance between finite models.

Lemma 9 Let $\mathcal{M} = (\mathcal{X},\mathcal{F},\{P_0,\cdots,P_m \})$ and $\mathcal{N} = (\mathcal{Y}, \mathcal{G}, \{Q_0,\cdots,Q_m\})$ be two finite models with standard distributions $\mu_1, \mu_2$ respectively. Then

$\Delta(\mathcal{M},\mathcal{N}) \le (m+1)\cdot \|\mu_1 - \mu_2\|_{\text{\rm D}} := (m+1)\cdot \sup_{\phi} |\mathop{\mathbb E}_{\mu_1} \phi - \mathop{\mathbb E}_{\mu_2} \phi|,$

where $\|\mu_1 - \mu_2\|_{\text{\rm D}}$ denotes the Dudley’s metric between probability measures $\mu_1,\mu_2$, and the supremum is taken over all measurable functions $\phi: \mathcal{S}_{m+1}\rightarrow {\mathbb R}$ with $\|\phi\|_\infty\le 1$ and $|\phi(t_1) - \phi(t_2)| \le \|t_1 - t_2\|_\infty$ for any $t_1,t_2\in \mathcal{S}_{m+1}$

Remark 1 Recall that Dudley’s metric metrizes the weak convergence of probability measures on a metric space with its Borel $\sigma$-algebra. The fact that it is smaller than the total variation distance will be crucial to establish Theorem 6.

Proof: Similar to the proof of the randomization criterion (Theorem 5 in Lecture 3), the following upper bound on the model distance holds:

$\Delta(\mathcal{M},\mathcal{N}) \le \sup_{L,\pi} | R_{L,\pi}^\star(\mathcal{M}) - R_{L,\pi}^\star(\mathcal{N}) |,$

where $R_{L,\pi}^\star(\mathcal{M})$ denotes the Bayes risk of model $\mathcal{M}$ under loss $L$ and prior $\pi$, and the loss $L$ is non-negative and upper bounded by one in the supremum. By algebra, the Bayes risk admits the following simple form under the standard model with standard distribution $\mu$

$R_{L,\pi}^\star(\mu) = \int_{\mathcal{S}_{m+1}} \left(\inf_{c(t)\in \mathcal{C}_{L,\pi}(t)} c(t)^\top t \right) \mu(dt),$

where the set $\mathcal{C}_{L,\pi}(t)$ is defined as

$\mathcal{C}_{L,\pi}(t) = \left\{\left(\pi_0\int L(0,a)\delta(t,da), \cdots, \pi_m\int L(m,a)\delta(t,da) \right): \text{stochastic kernel }\delta: \mathcal{S}_{m+1}\rightarrow \mathcal{A} \right\}.$

Since the diameter of $\mathcal{C}_{L,\pi}(t)$ in $\|\cdot\|_1$ is one, we conclude that $t\mapsto \inf_{c(t)\in \mathcal{C}_{L,\pi}(t)} c(t)^\top t$ is upper bounded by $m+1$ and $1$-Lipschitz under $\|\cdot\|_\infty$. The rest of the proof follows from the definition of Dudley’s metric. $\Box$

Finally we are ready to present the proof of Theorem 6. Note that there is a bijective map between $(L_1,\cdots,L_m)$ and $(\frac{dP_0}{d\overline{P}},\cdots,\frac{dP_m}{d\overline{P}})$, which is continuous under the model $\mathcal{M}$ due to the homogeneity assumption. Then by continuous mapping theorem (see remark below), the weak convergence of likelihood ratios implies the weak convergence of their standard distributions. Since Dudley’s metric metrizes the weak convergence of probability measures, the result of Lemma 9 completes the proof.

Remark 2 The continuous mapping theorem for weak convergence states that, if Borel-measurable random variables $X_n$ converges weakly to $X$ on a metric space, and $f$ is a function continuous on a set $C$ such that $\mathop{\mathbb P}(X\in C)=1$, then $f(X_n)$ also converges weakly to $f(X)$. Note that the function $f$ is only required to be continuous on the support of the limiting random variable $X$

3.2. Locally Asymptotically Normal (LAN) Models

Motivated by Theorem 6, in order to prove that certain models asymptotically become normal, we may show that the likelihood functions weakly converge to those in the normal model. Note that for the Gaussian location model $\mathcal{M}=({\mathbb R}^d, \mathcal{B}({\mathbb R}^d), (\mathcal{N}(\theta,\Sigma))_{\theta\in{\mathbb R}^d})$, the log-likelihood ratio is given by

$\log \frac{dP_{\theta+h}}{dP_\theta}(x) = h^\top \Sigma^{-1}(x-\theta) - \frac{1}{2}h^\top \Sigma^{-1}h, \ \ \ \ \ (3)$

where $\Sigma^{-1}(x-\theta)\sim \mathcal{N}(0,\Sigma^{-1})$ for $x\sim P_\theta$. The equation (3) motivates the following definition of local asymptotic normal (LAN) models, in which the likelihood function looks like (3).

Definition 10 (Local Asymptotic Normality) A sequence of models $\mathcal{M}_n = (\mathcal{X}_n, \mathcal{F}_n, (P_{n,h})_{h\in \Theta_n})$ with $\Theta_n \uparrow {\mathbb R}^d$ is called locally asymptotically normal (LAN) with central sequence $Z_n$ and Fisher information matrix $I_0$ if

$\log\frac{dP_{n,h}}{dP_{n,0}} = h^\top Z_n - \frac{1}{2}h^\top I_0h + r_n(h), \ \ \ \ \ (4)$

with $Z_n \overset{d}{\rightarrow} \mathcal{N}(0,I_0)$ under $P_{n,0}$, and $r_n(h)$ converges to zero in probability under $P_{n,0}$ for any fixed $h\in {\mathbb R}^d$

Based on the form of the likelihood ratio in (4), the following theorem is then immediate.

Theorem 11 If a sequence of models $\mathcal{M}_n$ satisfies the LAN condition with Fisher information matrix $I_0$, then $\lim_{n\rightarrow\infty} \Delta(\mathcal{M}_n, \mathcal{M}) = 0$ for $\mathcal{M}=({\mathbb R}^d, \mathcal{B}({\mathbb R}^d), (\mathcal{N}(\theta,I_0^{-1}))_{\theta\in{\mathbb R}^d})$

Proof: Note that for any finite sub-model, Slutsky’s theorem applied to (4) gives the desired convergence in distribution, and clearly the Gaussian location model is homogeneous. Now applying Theorem 6 gives the desired convergence. We leave the discussion of the general case in the bibliographic notes. $\Box$

Now the only remaining task is to check the likelihood ratios for some common models and show that the LAN condition is satisfied. For example, for QMD models $(P_\theta)_{\theta\in\Theta}$, we have

$\log \frac{dP_{\theta+h/\sqrt{n}}^{\otimes n}}{dP_\theta^{\otimes n}}(X) = h^\top \left( \frac{1}{\sqrt{n}}\dot{\ell}_\theta(X_i) \right) + \frac{1}{2} h^\top \left(\frac{1}{n}\sum_{i=1}^n \ddot{\ell}_\theta(X_i) \right)h + o_{P_\theta^{\otimes n}}(1),$

where intuitively by CLT and LLN will arrive at the desired form (4). The next proposition makes this intuition precise.

Proposition 12 Let $(P_\theta)_{\theta\in\Theta}$ be QMD in an open set $\Theta\in{\mathbb R}^d$ with Fisher information matrix $I(\theta)$. Then the sequence of model $\mathcal{M}_n = (\mathcal{X}^n, \mathcal{F}^{\otimes n}, (P_{\theta+h/\sqrt{n}}^{\otimes n} )_{h\in \Theta_n} )$, with $\Theta_n = \{h\in {\mathbb R}^d: \theta+ h/\sqrt{n}\in \Theta\}\uparrow {\mathbb R}^d$ satisfies the LAN condition with Fisher information $I(\theta)$

Proof: Write $P_n = P_{\theta+h/\sqrt{n}}, P = P_\theta$ and $g = h^\top \dot{\ell}_\theta$. Then by Taylor expansion,

$\begin{array}{rcl} \log\frac{dP_n^{\otimes n}}{dP^{\otimes n}}(X) & = & \sum_{i=1}^n \log \frac{dP_n}{dP}(X_i) \\ &=& 2\sum_{i=1}^n \log\left(1+\frac{1}{2}W_{ni}\right) \\ &=& \sum_{i=1}^n W_{ni} - \frac{1}{4}\sum_{i=1}^n W_{ni}^2 + \sum_{i=1}^n W_{ni}^2 r(W_{ni}), \end{array}$

where $W_{ni} := 2(\sqrt{(dP_n/dP)(X_i)} - 1)$, and $r(W_{ni}) = o_{W_{ni}}(1)$. By the QMD condition, we have

$\text{Var}\left(\sum_{i=1}^n W_{ni} - \frac{1}{\sqrt{n}}\sum_{i=1}^n g(X_i) \right) = n\text{Var}\left(W_{n1} - \frac{g(X_1)}{\sqrt{n}}\right) = o(1).$

Moreover, $\mathop{\mathbb E}_P g(X_i) = 0$ by the property of the score function, and

$\mathop{\mathbb E}\left[\sum_{i=1}^n W_{ni}\right] = -n\int (\sqrt{dP_n} - \sqrt{dP})^2 \rightarrow - \frac{1}{4}\mathop{\mathbb E}_P[g(X)^2].$

Consequently, we conclude that

$\sum_{i=1}^n W_{ni} = \frac{1}{\sqrt{n}}\sum_{i=1}^n g(X_i) -\frac{1}{4}\mathop{\mathbb E}_P[g(X)^2] + o_P(1).$

For the second term, the QMD condition gives $nW_{ni}^2 = g(X_i)^2 + o_P(1)$, and therefore LLN gives $\sum_{i=1}^n W_{ni}^2 = \mathop{\mathbb E}_P[g(X)^2] + o_P(1)$. For the last term, the Markov’s inequality gives $n\mathop{\mathbb P}(|W_{ni}|\ge \varepsilon)\rightarrow 0$, and therefore $\max_{i\in [n]} | r(W_{ni}) | = o_P(1)$, as desired. $\Box$

In other words, Proposition 12 implies that all regular statistical models locally look like normal, where the local radius is $\Theta(n^{-1/2})$

3.3. Proof of LAM Theorem

Now we are ready to glue all necessary ingredients together. First, for product QMD statistical models, Proposition 12 implies that LAN condition is satisfied for the local model around any chosen parameter $\theta$. Second, by Theorem 11, these local models will converge to a Gaussian location model with covariance $I(\theta)^{-1}$. By definition of the model distance, the minimax risk of these local models will be asymptotically no smaller than that of the limiting Gaussian location model, which by Theorem 3 is $\mathop{\mathbb E} \ell(Z)$ with $Z\sim \mathcal{N}(0,I(\theta)^{-1})$ for bowl-shaped loss $\ell$. Consequently, we have the following local asymptotic minimax theorem.

Theorem 13 (LAM, restated) Let $(P_\theta)_{\theta\in\Theta\subseteq {\mathbb R}^d}$ be a QMD statistical model which admits a non-singular Fisher information $I(\theta_0)$ at $\theta_0$. Let $\psi(\theta)$ be differentiable at $\theta=\theta_0$, and $T_n$ be an estimator sequence of $\psi(\theta)$ in the model $(P_\theta^{\otimes n})$. Consider any compact action space $\mathcal{A}\subset {\mathbb R}^d$, and any bowl-shaped loss function $\ell: {\mathbb R}^d\rightarrow {\mathbb R}_+$. Then

$\lim_{c\rightarrow\infty} \liminf_{n\rightarrow\infty} \sup_{\|h\|\le c } \mathop{\mathbb E}_{\theta_0+\frac{h}{\sqrt{n}}} \ell\left(\sqrt{n}\left(T_n - \psi\left(\theta + \frac{h}{\sqrt{n}}\right) \right) \right) \ge \mathop{\mathbb E} \ell(Z)$

with $Z\sim \mathcal{N}(0, \dot{\psi}_{\theta_0} I(\theta_0)^{-1}\dot{\psi}_{\theta_0}^\top)$

Note that here the compactness of the action space is required for the limiting arguments, while all our previous analysis consider finite models. Our arguments via model distance are also different from those used by H\'{a}jek and Le Cam, where they introduced the notion of contiguity to arrive at the same result with weaker conditions. Further details of these alternative approaches are referred to the bibliographic notes.

4. Applications and Limitations

In this section, we will apply the LAM theorem to prove asymptotic lower bounds for both parametric and nonparametric problems. We will also discuss the limitations of LAM to motivate the necessity of future lectures.

4.1. Parametric Entropy Estimation

Consider the discrete i.i.d. sampling model $X_1,\cdots,X_n\sim P=(p_1,\cdots,p_k)\in\mathcal{M}_k$, where $\mathcal{M}_k$ denotes the probability simplex on $k$ elements. The target is to estimate the Shannon entropy

$H(P) := \sum_{i=1}^k -p_i\log p_i$

under the mean squared loss. We can apply LAM to prove a local minimax lower bound for this problem.

First we compute the Fisher information of the multinomial model $X\sim P$, where we set $\theta=(p_1,\cdots,p_{k-1})$ to be the free parameters. It is easy to show that

$I(\theta) = \text{diag}(p_1^{-1},\cdots,p_{k-1}^{-1}) + p_k^{-1}{\bf 1}{\bf 1}^\top .$

By the matrix inversion formula $(A + UCV )^{-1} = A^{-1} - A^{-1}U(C^{-1} + V A^{-1}U)^{-1}V A^{-1}$, we have

$I(\theta)^{-1} = \text{diag}(\theta) - \theta\theta^\top.$

Now choosing $\psi(\theta) = (\sum_{i=1}^{k-1} -\theta_i\log \theta_i) - (1-\sum_{i=1}^{k-1}\theta_i)\log(1-\sum_{i=1}^{k-1}\theta_i)$ and $\ell(t) = t^2$ in LAM, after some algebra we arrive at

$\inf_{\hat{H}} \sup_{P\in \mathcal{M}_k} \mathop{\mathbb E}_P (\hat{H} - H(P))^2 \ge \frac{1+o_n(1)}{n} \sup_{P\in \mathcal{M}_k} \text{Var}\left[\log\frac{1}{P(X)}\right].$

Note that $\sup_{P\in \mathcal{M}_k}$ on the RHS is due to our arbitrary choice of the centering $\theta$

4.2. Nonparametric Entropy Estimation

Consider a continuous i.i.d. sampling model from some density, i.e., $X_1,\cdots,X_n\sim f$. Assume that the density $f$ is supported on ${\mathbb R}^d$, and the target is to estimate the differential entropy

$h(f) := \int_{{\mathbb R}^d} -f(x)\log f(x)dx.$

As before, we would like to prove a local minimax lower bound for the mean squared error around any target density $f_0$. However, since the model is nonparametric and the parameter $f$ has an infinite dimension, there is no Fisher information matrix for this model. To overcome this difficulty, we may consider a one-dimensional parametric submodel instead.

Let $g: {\mathbb R}^d\rightarrow {\mathbb R}$ be any measurable function with $\text{supp}(g)\subseteq \text{supp}(f_0)$ and $\int_{{\mathbb R}^d} g(x)dx=0$, then $f_0+tg$ is still a valid density on ${\mathbb R}^d$ for small $|t|$. Consequently, keeping $t$ small, the i.i.d. sampling model $X_1,\cdots,X_n\sim f_0+tg$ becomes a submodel parametrized only by $t$. For this 1-D parametric submodel, the Fisher information at $t=0$ can be computed as

$I_0 = \int_{{\mathbb R}^d} \frac{g^2(x)}{f_0(x)} dx.$

Setting $\psi(t) = h(f_0+tg)$ in LAM, we have

$\psi'(0) = \int_{{\mathbb R}^d} g(x)(1-\log f_0(x))dx,$

and consequently

$\begin{array}{rcl} \inf_{\hat{h}} \sup_{f} \mathop{\mathbb E}_f(\hat{h} - h(f))^2 &\ge& \inf_{\hat{h}} \sup_{t} \mathop{\mathbb E}_{f_0+tg}(\hat{h} - \psi(t))^2 \\ &\ge& \frac{1+o_n(1)}{n}\left(\int_{{\mathbb R}^d} \frac{g^2(x)}{f_0(x)} dx\right)^{-1} \left(\int_{{\mathbb R}^d} g(x)(1-\log f_0(x))dx\right)^2. \end{array}$

Since our choice of the test function $g$ is arbitrary, we may actually choose the worst-case $g$ such that the above lower bound is maximized. We claim that the maximum value is

$V(f_0):=\int_{{\mathbb R}^d} f_0(x)\log^2 f_0(x)dx - h(f_0)^2.$

Clearly, this value is attained for the test function $g (x)= f_0(x)(-\log f_0(x) - h(f_0)).$ For the maximality, the Cauchy–Schwartz inequality and the assumption $\int_{{\mathbb R}^d} g(x)dx=0$ gives

$\begin{array}{rcl} \left( \int_{{\mathbb R}^d} g(x)(1-\log f_0(x))dx \right)^2 &=& \left( \int_{{\mathbb R}^d} g(x)(-\log f_0(x) - h(f_0))dx \right)^2\\ &\le & \int_{{\mathbb R}^d} \frac{g(x)^2}{f_0(x)}dx \cdot \int_{{\mathbb R}^d} f_0(x)(-\log f_0(x) - h(f_0))^2dx \\ &=& V(f_0)\cdot \int_{{\mathbb R}^d} \frac{g(x)^2}{f_0(x)}dx. \end{array}$

Therefore, the parametric lower bound for nonparametric entropy estimation is

$\inf_{\hat{h}} \sup_{f} \mathop{\mathbb E}_f(\hat{h} - h(f))^2 \ge \frac{1+o_n(1)}{n}\cdot V(f_0).$

4.3. Limitations of Classical Asymptotics

The theorems from classical asymptotics can typically help to prove an error bound $\Theta(n^{-1/2})$ with an explicit constant, and it is also known that these bounds are optimal and achieved by MLE. However, there are still some problems in these approaches:

1. Non-asymptotic vs. asymptotic: Asymptotic bounds are useful only in scenarios where the problem size remains fixed and the sample size grows to infinity, and there is no general guarantee of when we have entered the asymptotic regime (it may even require that $n\gg e^d$). In practice, essentially all recent problems are high-dimensional ones where the number of parameters is comparable to or even larger than the sample size (e.g., an over-parametrized neural network), and some key properties of the problem may be entirely obscured in the asymptotic regime.
2. Parametric vs. nonparametric: The results in classical asymptotics may not be helpful for a large quantity of nonparametric problems, where the main problem is the infinite-dimensional nature of nonparametric problems. Although sometimes the parametric reduction is helpful (e.g., the entropy example), the parametric rate $\Theta(n^{-1/2})$ is in general not attainable in nonparametric problems and some other tools are necessary. For example, if we would like to estimate the density $f$ at some point $x_0$, the worst-case test function will actually give a vacuous lower bound (which is infinity).
3. Global vs. local: As the name of LAM suggests, the minimax lower bound here holds locally. However, the global structure of some problems may also be important, and the global minimax lower bound may be much larger than the supremum of local bounds over all possible points. For example, in Shannon entropy estimation, the bias is actually dominating the problem and cannot be reflected in local methods.

To overcome these difficulties, we need to develop tools to establish non-asymptotic results for possibly high-dimensional or nonparametric problems, which is the focus of the rest of the lecture series.

5. Bibliographic Notes

The asymptotic theorems in Theorem 2 are first presented in Hájek (1970) and Hájek (1972), and we refer to Le Cam (1986), Le Cam and Yang (1990) and van der Vaart (2000) as excellent textbooks. Here the approach of using model distance to establish LAM is taken from Section 4, 6 of Liese and Miescke (2007); also see Le Cam (1972).

There is another line of approach to establish the asymptotic theorems. A key concept is the contiguity proposed by Le Cam (1960), which enables an asymptotic change of measure. Based on contiguity and LAN condition, the distribution of any (regular) estimator under the local alternative can be evaluated. Then the convolution theorem can be shown, which helps to establish LAM; details can be found in van der Vaart (2000). LAM theorem can also be established directly by computing the asymptotic Bayes risk under proper priors; see Section 6 of Le Cam and Yang (1990).

For parametric or nonparametric entropy estimation, we refer to recent papers (Jiao et al. (2015) and Wu and Yang (2016) for the discrete case, Berrett, Samworth and Yuan (2019) and Han et al. (2017) for the continuous case) and the references therein.

1. Jaroslav Hájek, A characterization of limiting distributions of regular estimates. Zeitschrift fÃ¼r Wahrscheinlichkeitstheorie und verwandte Gebiete 14.4 (1970): 323-330.
2. Jaroslav Hájek, Local asymptotic minimax and admissibility in estimation. Proceedings of the sixth Berkeley symposium on mathematical statistics and probability. Vol. 1. 1972.
3. Lucien M. Le Cam, Asymptotic methods in statistical theory. Springer, New York, 1986.
4. Lucien M. Le Cam and Grace Yang, Asymptotics in statistics. Springer, New York, 1990.
5. Aad W. Van der Vaart, Asymptotic statistics. Vol. 3. Cambridge university press, 2000.
6. Friedrich Liese and Klaus-J. Miescke. Statistical decision theory. Springer, New York, NY, 2007.
7. Lucien M. Le Cam, Limits of experiments. Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Theory of Statistics. The Regents of the University of California, 1972.
8. Lucien M. Le Cam, Locally asymptotically normal families of distributions. University of California Publications in Statistics 3, 37-98 (1960).
9. Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman, Minimax estimation of functionals of discrete distributions. IEEE Transactions on Information Theory 61.5 (2015): 2835-2885.
10. Yihong Wu and Pengkun Yang, Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory 62.6 (2016): 3702-3720.
11. Thomas B. Berrett, Richard J. Samworth, and Ming Yuan, Efficient multivariate entropy estimation via $k$-nearest neighbour distances. The Annals of Statistics 47.1 (2019): 288-318.
12. Yanjun Han, Jiantao Jiao, Tsachy Weissman, and Yihong Wu, Optimal rates of entropy estimation over Lipschitz balls. arXiv preprint arXiv:1711.02141 (2017).

# Lecture 3: Statistical Decision Theory: Model Distance and Equivalence

(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.)

This lecture starts to talk about specific tools and ideas to prove information-theoretic lower bounds. We will temporarily restrict ourselves to statistical inference problems (which most lower bounds apply to), where the presence of randomness is a key feature in these problems. Since the observer cannot control the realizations of randomness, the information contained in the observations, albeit not necessarily in a discrete structure (e.g., those in Lecture 2), can still be limited. In this lecture and subsequent ones, we will introduce the reduction and hypothesis testingideas to prove lower bounds of statistical inference, and these ideas will also be applied to other problems.

1. Basics of Statistical Decision Theory

To introduce statistical inference problems, we first review some basics of statistical decision theory. Mathematically, let $(\mathcal{X}, \mathcal{F}, (P_\theta)_{\theta\in\Theta})$ be a collection of probability triplets indexed by $\theta$, and the observation $X$ is a random variable following distribution $P_{\theta}$ for an unknown parameter $\theta$ in the known parameter set $\Theta$. The specific structure of $(P_\theta)_{\theta\in\Theta}$ is typically called models or experiments, for the parameter $\theta$ can represent different model parameters or theories to explain the observation $X$. A decision rule, or a (randomized) estimator is a transition kernel $\delta(x,da)$ from $\mathcal{X}$ to some action space $\mathcal{A}$. Let $L: \Theta\times \mathcal{A}\rightarrow {\mathbb R}_+$ be a loss function, where $L(\theta,a)$ represents the loss of using action $a$ when the true parameter is $\theta$. A central quantity to measure the quality of a decision rule is the risk in the following definition.

Definition 1 (Risk) Under the above notations, the risk of the decision rule $\delta$ under loss function $L$ and the true parameter $\theta$ is defined as

$R_\theta(\delta) = \iint L(\theta,a)P_\theta(dx)\delta(x,da). \ \ \ \ \ (1)$

Intuitively, the risk characterizes the expected loss (over the randomness in both the observation and the decision rule) of the decision rule when the true parameter is $\theta$. When $\delta(x,da)$ is a point mass $\delta(a-T(x))$ for some deterministic function $T:\mathcal{X}\rightarrow \mathcal{A}$, we will also call $T(X)$ as an estimator and the risk in (1) becomes

$R_\delta(T) = \int L(\theta,T(x)) P_\theta(dx) = \mathop{\mathbb E}_{\theta} L(\theta, T(X)). \ \ \ \ \ (2)$

Example 1 In linear regression model with random design, let the observations $(x_1,y_1), \cdots, (x_n,y_n)\in {\mathbb R}^p\times {\mathbb R}$ be i.i.d. with $x_i \sim P_X$ and $y_i|x_i\sim \mathcal{N}(x_i^\top \theta, \sigma^2)$. Here the parameter set $\Theta={\mathbb R}^p$ is a finite-dimensional Euclidean space, and therefore we call this model parametric. The quantity of interest is $\theta$, and the loss function may be chosen to be the prediction error $L(\theta,\hat{\theta}) = \mathop{\mathbb E}_{\theta} (y-x^\top \hat{\theta})^2$ of the linear estimator $f(x) = x^\top \hat{\theta}$

Example 2 In density estimation model, let $X_1, \cdots, X_n$ be i.i.d. drawn from some $1$-Lipschitz density $f$ supported on $[0,1]$. Here the parameter set $\Theta$ of the unknown $f$ is the infinite-dimensional space of all possible $1$-Lipschitz functions on $[0,1]$, and we call this model non-parametric. The target may be to estimate the density $f$ at a point, the entire density, or some functional of the density. In respective settings, the loss functions can be

$L_1(f,T) = |T - f(0)|, \quad L_2(f,T) = \int_0^1 (T(x) - f(x))^2dx, \quad L_3(f,T) = |T - \|f\|_1|.$

Example 3 By allowing general action spaces and loss functions, the decision-theoretic framework can also incorporate some non-statistical examples. For instance, in stochastic optimization $\theta\in\Theta$ may parameterize a class of convex Lipschitz functions $f_\theta: [-1,1]^d\rightarrow {\mathbb R}$, and $X$ denotes the noisy observations of the gradients at the queried points. Then the action space $\mathcal{A}$ may just be the entire domain $[-1,1]^d$, and the loss function $L$ is the optimality gap defined as

$L(\theta, a) = f_\theta(a) - \min_{a^\star \in [0,1]^d} f_\theta(a^\star).$

In practice, one would like to find optimal decision rules for a given task. However, the risk is a function of $\theta$ and it is hard to compare two risk functions directly. Hence, people typically map the risk functions into scalars and arrive at the following minimax and Bayesian paradigm.

Definition 2 (Minimax and Bayes Decision Rule) The minimax decision rule is the decision rule $\delta$ which minimizes the quantity $\max_{\theta\in\Theta} R_\theta(\delta)$. The Bayes decision rule under distribution $\pi(d\theta)$ (called the prior distribution) is the decision rule $\delta$ which minimizes the quantity $\int R_\theta(\delta)\pi(d\theta)$

It is typically hard to find the minimax decision rule in practice, while the Bayes decision rule admits a closed-form expression (although hard to compute in general).

Proposition 3 The Bayes decision rule under prior $\pi$ is given by the estimator

$T(x) \in \arg\min_{a\in\mathcal{A}} \int L(\theta,a)\pi(d\theta|x), \ \ \ \ \ (3)$

where $\pi(d\theta|x)$ denotes the posterior distribution of $\theta$ under $\pi$ (assuming the existence of regular posterior).

Proof: Left as an exercise for the reader. $\Box$

2. Deficiency and Model Distance

The central target of statistical inference is to propose some decision rule for a given statistical model with small risks. Similarly, the counterpart in the lower bound is to prove that certain risks are unavoidable for any decision rules. Here to compare risks, we may either compare the entire risk function, or its minimax or Bayes version. In this lecture we will focus on the risk function, and many later lectures will be devoted to appropriate minimax risks.

We start with the task of comparing two statistical models with the same parameter set $\Theta$. Intuitively, one may think that the model with a smaller noise level would be better than the other, e.g., the model $\mathcal{M}_1 = \{\mathcal{N}(\theta,1): \theta\in {\mathbb R} \}$ should be better than $\mathcal{M}_2 = \{\mathcal{N}(\theta,2): \theta\in {\mathbb R} \}$. However, this criterion is bad due to two reasons:

1. There is no proper notion of noise for general (especially non-additive) statistical models;
2. Even if a natural notion of noise exists for certain models, it is not necessarily true that the model with smaller noise is always better. For example, let $\mathcal{M}_1 = \{\text{Unif}\{\theta-1,\theta+1 \}: |\theta|\le 1\}, \quad \mathcal{M}_2 = \{\text{Unif}\{\theta-3,\theta+3 \}: |\theta|\le 1\},$
the model $\mathcal{M}_1$ has a smaller magnitude of the additive noise. However, despite with a larger magnitude of the noise, the model $\mathcal{M}_2$ is actually noise-free since one can decode the perfect knowledge of $\theta$ from the observation.

To overcome the above difficulties, we introduce the idea of model reduction. Specifically, we aim to establish the fact that for any decision rule in one model, there is another decision rule in the other model which is uniformly no worse than the former rule. The idea of reduction appears in many fields, e.g., in P/NP theory it is sufficient to work out one NP-complete instance (e.g., circuit satisfiability) from scratch and establish all others by polynomial reduction. This reduction idea is made precise via the following definition.

Definition 4 (Model Deficiency) For two statistical models $\mathcal{M} = (\mathcal{X}, \mathcal{F}, (P_{\theta})_{\theta\in \Theta})$ and $\mathcal{N} = (\mathcal{Y}, \mathcal{G}, (Q_{\theta})_{\theta\in \Theta})$, we call $\mathcal{M}$ is $\varepsilon$-deficient relative to $\mathcal{N}$ if for any finite subset $\Theta_0\subseteq \Theta$, any finite action space $\mathcal{A}$, any loss function $L: \Theta_0\times \mathcal{A}\rightarrow [0,1]$, and any decision rule $\delta_{\mathcal{N}}$ based on model $\mathcal{N}$, there exists some decision rule $\delta_{\mathcal{M}}$ based on model $\mathcal{M}$ such that

$R_\theta(\delta_{\mathcal{M}}) \le R_\theta(\delta_{\mathcal{N}}) + \varepsilon, \qquad \forall \theta\in \Theta_0. \ \ \ \ \ (4)$

Note that the definition of model deficiency does not involve the specific choice of the action space and loss function, and the finiteness of $\Theta_0$ and $\mathcal{A}$ in the definition is mainly for technical purposes. Intuitively speaking, $\mathcal{M}$ is $\varepsilon$-deficient relative to $\mathcal{N}$ if the entire risk function of some decision rule in $\mathcal{M}$ is no worse than that of any given decision rule in $\mathcal{N}$, within an additive gap $\varepsilon$

Although Definition 4 gives strong performance guarantees for the $\varepsilon$-deficient model $\mathcal{M}$, it is difficult to verify the condition (4), i.e., to transform an arbitrary decision rule $\delta_\mathcal{N}$ to some proper $\delta_\mathcal{M}$. A crucial observation here is that it may be easier to transform between models $\mathcal{M}$ and $\mathcal{N}$, and in particular, when $\mathcal{N}$ is a randomization of $\mathcal{M}$. For example, if there exists some stochastic kernel $\mathsf{K}: \mathcal{X} \rightarrow \mathcal{Y}$ such that $Q_\theta = \mathsf{K}P_\theta$ for all $\theta\in\Theta$ (where $\mathsf{K}P_\theta$ denotes the marginal distribution of the output of kernel $\mathsf{K}$ with input distributed as $P_\theta$), then we may simply set $\delta_\mathcal{M} = \delta_\mathcal{N} \circ \mathsf{K}$ to arrive at (4) with $\varepsilon=0$. The following theorem shows that model deficiency is in fact equivalent to approximate randomization.

Theorem 5 Model $\mathcal{M}$ is $\varepsilon$-deficient with respect to $\mathcal{N}$ if and only if there exists some stochastic kernel $\mathsf{K}: \mathcal{X} \rightarrow \mathcal{Y}$ such that

$\sup_{\theta\in\Theta} \|Q_\theta - \mathsf{K}P_\theta \|_{\text{\rm TV}} \le \varepsilon,$

where $\|P-Q\|_{\text{\rm TV}} := \frac{1}{2}\int |dP-dQ|$ is the total variation distance between probability measures $P$ and $Q$

Proof: The sufficiency part is easy. Given such a kernel $\mathsf{K}$ and a decision rule $\delta_\mathcal{N}$ based on model $\mathcal{N}$, we simply set $\delta_\mathcal{M} = \delta_\mathcal{N} \circ \mathsf{K}$, i.e., transmit the output through kernel $\mathsf{K}$ and apply $\delta_\mathcal{N}$. Then for any $\theta\in\Theta$

$\begin{array}{rcl} R_\theta(\delta_{\mathcal{M}}) - R_\theta(\delta_{\mathcal{N}}) &=& \iint L(\theta,a)\delta_\mathcal{N}(y,da) \left[\int P_\theta(dx)\mathsf{K}(dy|x)- Q_\theta(dy) \right] \\ &\le & \|Q_\theta - \mathsf{K}P_\theta \|_{\text{TV}} \le \varepsilon, \end{array}$

for the loss function is non-negative and upper bounded by one.

The necessity part is slightly more complicated, and for simplicity we assume that all $\Theta, \mathcal{X}, \mathcal{Y}$ are finite (the general case requires proper limiting arguments). In this case, any decision rules $\delta_\mathcal{M}$ or $\delta_\mathcal{N}$, loss functions $L$ and priors $\pi(d\theta)$ can be represented by a finite-dimensional vector. Given $\mathcal{A}$ and $\delta_{\mathcal{N}}$, the condition (4) ensures that

$\sup_{L(\theta,a),\pi(d\theta)} \inf_{\delta_{\mathcal{M}}}\iint L(\theta,a)\pi(d\theta)\left[\int \delta_\mathcal{M}(x,da)P_\theta(dx) - \int \delta_\mathcal{N}(y,da)Q_\theta(dy)\right] \le \varepsilon. \ \ \ \ \ (5)$

Note that the LHS of (5) is bilinear in $L(\theta,a)\pi(d\theta)$ and $\delta_\mathcal{M}(x,da)$, both of which range over some convex sets (e.g., the domain for $M(\theta,a) := L(\theta,a)\pi(d\theta)$ is exactly $\{M\in [0,1]^{\Theta\times \mathcal{A}}: \sum_\theta \|M(\theta, \cdot)\|_\infty \le 1 \}$), the minimax theorem allows to swap $\sup$ and $\inf$ of (5) to obtain that

$\inf_{\delta_{\mathcal{M}}}\sup_{L(\theta,a),\pi(d\theta)}\iint L(\theta,a)\pi(d\theta)\left[\int \delta_\mathcal{M}(x,da)P_\theta(dx) - \int \delta_\mathcal{N}(y,da)Q_\theta(dy)\right] \le \varepsilon. \ \ \ \ \ (6)$

By evaluating the inner supremum, (6) implies the existence of some $\delta_\mathcal{M}^\star$ such that

$\sup_{\theta\in\Theta} \frac{1}{2}\int_{\mathcal{A}} \left| \int_{\mathcal{X}} \delta_\mathcal{M}^\star(x,da)P_\theta(dx) - \int_{\mathcal{Y}} \delta_\mathcal{N}(y,da)Q_\theta(dy)\right| \le \varepsilon. \ \ \ \ \ (7)$

Finally, choosing $\mathcal{A}=\mathcal{Y}$ and $\delta_\mathcal{N}(y,da) = 1(y=a)$ in (7), the corresponding $\delta_\mathcal{M}^\star$ is the desired kernel $\mathsf{K}$. $\Box$

Based on the notion of deficiency, we are ready to define the distance between statistical models, also known as the Le Cam’s distance

Definition 6 (Le Cam’s Distance) For two statistical models $\mathcal{M}$ and $\mathcal{N}$ with the same parameter set $\Theta$, Le Cam’s distance $\Delta(\mathcal{M},\mathcal{N})$ is defined as the infimum of $\varepsilon\ge 0$ such that $\mathcal{M}$ is $\varepsilon$-deficient relative to $\mathcal{N}$, and $\mathcal{N}$ is $\varepsilon$-deficient relative to $\mathcal{M}$

It is a simple exercise to show that Le Cam’s distance is a pesudo-metric in the sense that it is symmetric and satisfies the triangle inequality. The main importance of Le Cam’s distance is that it helps to establish equivalence between some statistical models, and people are typically interested in the case where $\Delta(\mathcal{M},\mathcal{N})=0$ or $\lim_{n\rightarrow\infty} \Delta(\mathcal{M}_n, \mathcal{N}_n)=0$. The main idea is to use randomization (i.e., Theorem 5) to obtain an upper bound on Le Cam’s distance, and then apply Definition 4 to deduce useful results (e.g., to carry over an asymptotically optimal procedure in one model to other models).

In the remainder of this lecture, I will give some examples of models whose distance is zero or asymptotically zero. In the next lecture it will be shown that regular models will always be close to some Gaussian location model asymptotically, and thereby the classical asymptotic theory of statistics can be established.

3. Equivalence between Models

3.1. Sufficiency

We first examine the case where $\Delta(\mathcal{M},\mathcal{N})=0$. By Theorem 5, models $\mathcal{M}$ and $\mathcal{N}$ are mutual randomizations. In the special case where $Y=T(X)\sim Q_\theta$ is a deterministic function of $X\sim P_\theta$ (thus $Q_\theta=P_\theta\circ T^{-1}$ is the push-forward measure of $P_\theta$ through $T$), we have the following result.

Theorem 7 Under the above setting, $d(\mathcal{M},\mathcal{N})=0$ if and only if $\theta-Y-X$ forms a Markov chain.

Note that the Markov condition $\theta-Y-X$ is the usual definition of sufficient statistics, and also gives the well-known Rao–Blackwell factorization criterion for sufficiency. Hence, sufficiency is in fact a special case of model equivalence, and deficiency can be thought of as approximate sufficiency.

3.2. Equivalence between Multinomial and Poissonized Models

Consider a discrete probability vector $P=(p_1,\cdots,p_k)$ with $p_i\ge 0, \sum_{i=1}^k p_i=1$. A widely-used model in practice is the multinomial model $\mathcal{M}_n$, which models the i.i.d. sampling process and draws i.i.d. observations $X_1,\cdots, X_n\sim P$. However, a potential difficulty in handling multinomial models is that the empirical frequencies $\hat{p}_1, \cdots, \hat{p}_k$ of symbols are dependent, which makes the analysis annoying. To overcome this difficulty, a common procedure is to consider a Poissonized model $\mathcal{N}_n$, where we draw a Poisson random variable $N\sim \text{Poisson}(n)$ first and observes i.i.d. $X_1,\cdots,X_N\sim P$. Due to the nice properties of Poisson random variables, the empirical frequencies now follow independent scaled Poisson distribution.

The next theorem shows that the multinomial and Poissonized models are asymptotically equivalent, which means that it actually does no harm to consider the more convenient Poissonized model for analysis, at least asymptotically. In later lectures I will also show a non-asymptotic result between these two models.

Theorem 8 For fixed $k$, $\lim_{n\rightarrow\infty} \Delta(\mathcal{M}_n, \mathcal{N}_n)=0$

Proof: We only show that $\mathcal{M}_n$ is $\varepsilon_n$-deficient relative to $\mathcal{N}_n$, with $\lim_{n\rightarrow\infty} \varepsilon_n=0$, where the other direction is analogous. By Theorem 5, it suffices to show that $\mathcal{N}_n$ is an approximate randomization of $\mathcal{M}_n$. The randomization procedure is as follows: based on the observations $X_1,\cdots,X_n$ under the multinomial model, let $P_n=(\hat{p}_1,\cdots,\hat{p}_k)$ be the vector of empirical frequencies. Next draw an independent random variable $N\sim \text{Poisson}(n)$. If $N\le n$, let $(X_1,\cdots,X_N)$ be the output of the kernel. Otherwise, we generate i.i.d. samples $X_{n+1}', \cdots, X_N'\sim P_n$, and let $(X_1,\cdots,X_n,X_{n+1}',\cdots,X_N')$ be the output. We remark that it is important that the above randomization procedure does not depend on the unknown $P$

Let $\mathcal{N}_P, \mathcal{N}_P'$ be the distribution of the Poissonized and randomized model under true parameter $P$, respectively. Now it is easily shown that

$\|\mathcal{N}_P- \mathcal{N}_P' \|_{\text{TV}} = \mathop{\mathbb E}_m \mathop{\mathbb E}_{X^n} \|P_n^{\otimes m} - P^{\otimes m} \|_{\text{TV}}, \ \ \ \ \ (8)$

where $m:=(N-n)_+$, $P^{\otimes m}$ denotes the $m$-fold produce of $P$, $\mathop{\mathbb E}_m$ takes the expectation w.r.t. $m$, and $\mathop{\mathbb E}_{X^n}$ takes the expectation w.r.t. random samples $X^n\sim P$. To upper bound the total variation distance in (8), we shall need the following lemma.

Lemma 9 Let $D_{\text{\rm KL}}(P\|Q) = \int dP\log \frac{dP}{dQ}$ and $\chi^2(P,Q) = \int \frac{(dP-dQ)^2}{dQ}$ be the KL divergence and $\chi^2$-divergence, respectively. Then

$2\|P-Q \|_{\text{\rm TV}}^2 \le D_{\text{\rm KL}}(P\|Q) \le \chi^2(P,Q).$

The proof of Lemma 9 will be given in later lectures when we talk about joint ranges of divergences. Then by Lemma 9 and Jensen’s inequality,

$\begin{array}{rcl} \mathop{\mathbb E}_{X^n} \|P_n^{\otimes m} - P^{\otimes m} \|_{\text{TV}} & \le & \mathop{\mathbb E}_{X^n}\sqrt{\frac{1}{2} D_{\text{KL}}(P_n^{\otimes m},P^{\otimes m} ) }\\ &=& \mathop{\mathbb E}_{X^n}\sqrt{\frac{m}{2} D_{\text{KL}}(P_n,P ) } \\ &\le& \mathop{\mathbb E}_{X^n}\sqrt{\frac{m}{2} \chi^2(P_n,P ) }\\ &\le& \sqrt{\frac{m}{2} \mathop{\mathbb E}_{X^n}\chi^2(P_n,P ) }. \end{array}$

Since by simple algebra,

$\mathop{\mathbb E}_{X^n}\chi^2(P_n,P ) = \sum_{i=1}^k \frac{\mathop{\mathbb E}_{X^n} (\hat{p}_i-p_i)^2 }{p_i} = \sum_{i=1}^k \frac{p_i(1-p_i)}{np_i} = \frac{k-1}{n},$

then by (8) we obtain that

$\|\mathcal{N}_P- \mathcal{N}_P' \|_{\text{TV}} \le \mathop{\mathbb E}_m \sqrt{\frac{m(k-1)}{2n}} \le \sqrt{\frac{k-1}{2n}}\cdot (\mathop{\mathbb E} m^2)^{\frac{1}{4}} \le \sqrt{\frac{k-1}{2\sqrt{n}}},$

which goes to zero uniformly in $P$ as $n\rightarrow\infty$, as desired. $\Box$

Remark 1 In fact, the following non-asymptotic characterization

$\Delta(\mathcal{M}_{n,k}, \mathcal{N}_{n,k}) \asymp \min\{1, \sqrt{k/n} \}$

could be established. This result is contained in an upcoming paper.

3.3. Equivalence between Nonparametric Regression and Gaussian White Noise Models

A well-known problem in nonparametric statistics is the nonparametric regression:

$y_i = f\left(\frac{i}{n}\right) + \sigma\xi_i, \qquad i=1,\cdots,n, \quad \xi_i\overset{\text{i.i.d.}}{\sim} \mathcal{N}(0,1), \ \ \ \ \ (9)$

where the underlying regression function $f$ is unknown, and $\sigma>0$ is some noise level. Typically, the statistical goal is to recover the function $f$ at some point or globally, and some smoothness conditions are necessary to perform this task. A typical assumption is that $f\in \mathcal{H}^s(L)$ belongs to some H\”{o}lder ball, where

$\mathcal{H}^s(L) := \left\{f\in C[0,1]: \sup_{x\neq y}\frac{|f^{(m)}(x) - f^{(m)}(y)| }{|x-y|^\alpha} \le L\right\},$

with $s=m+\alpha, m\in {\mathbb N}, \alpha\in (0,1]$ denotes the smoothness parameter.

There is also a continuous version of (9) called the Gaussian white noise model, where a process $(Y_t)_{t\in [0,1]}$ satisfying the following stochastic differential equation is observed:

$dY_t = f(t)dt + \frac{\sigma}{\sqrt{n}}dB_t, \qquad t\in [0,1], \ \ \ \ \ (10)$

where $(B_t)_{t\in [0,1]}$ is the standard Brownion motion. Compared with the regression model in (9), the white noise model in (10) gets rid of the quantization issue of $[0,1]$ and is therefore easier to analyze. Note that in both models $n$ is effectively the sample size.

Let $\mathcal{M}_n, \mathcal{N}_n$ be the regression and white noise models with known parameters $(\sigma,L)$ and the paramter set $f\in \mathcal{H}^s(L)$, respectively. The main result in this section is that, when $s>1/2$, these models are asymptotically equivalent.

Theorem 10 If $s>1/2$, we have $\lim_{n\rightarrow\infty} \Delta(\mathcal{M}_n, \mathcal{N}_n)=0$

Proof: Consider another Gaussian white noise model $\mathcal{N}_n^\star$ where the only difference is to replace $f$ in (10) by $f^\star$ defined as

$f^\star(t) = \sum_{i=1}^n f\left(\frac{i}{n}\right) 1\left(\frac{i-1}{n}\le t<\frac{i}{n}\right), \qquad t\in [0,1].$

Note that under the same parameter $f$, we have

$\begin{array}{rcl} D_{\text{KL}}(P_{Y_{[0,1]}^\star} \| P_{Y_{[0,1]}}) &=& \frac{n}{2\sigma^2}\int_0^1 (f(t) - f^\star(t))^2dt\\ & =& \frac{n}{2\sigma^2}\sum_{i=1}^n \int_{(i-1)/n}^{i/n} (f(t) - f(i/n))^2dt \\ & \le & \frac{L^2}{2\sigma^2}\cdot n^{1-2(s\wedge 1)}, \end{array}$

which goes to zero uniformly in $f$ as $n\rightarrow\infty$. Therefore, by Theorem 5 and Lemma 9, we have $\Delta(\mathcal{N}_n, \mathcal{N}_n^\star)\rightarrow 0$. On the other hand, in the model $\mathcal{N}_n^\star$ the likelihood ratio between the signal distribution $P_{Y^\star}$ and the pure noise distribution $P_{Z^\star}$ is

$\begin{array}{rcl} \frac{dP_Y}{dP_Z}((Y_t^\star)_{t\in [0,1]}) &=& \exp\left(\frac{n}{2\sigma^2}\left(\int_0^1 2f^\star(t)dY_t^\star-\int_0^1 f^\star(t)^2 dt \right)\right) \\ &=& \exp\left(\frac{n}{2\sigma^2}\left(\sum_{i=1}^n 2f(i/n)(Y_{i/n}^\star - Y_{(i-1)/n}^\star) -\int_0^1 f^\star(t)^2 dt \right)\right). \end{array}$

As a result, under model $\mathcal{N}_n^\star$, there is a Markov chain $f \rightarrow (n(Y_{i/n}^\star - Y_{(i-1)/n}^\star))_{i\in [n]}\rightarrow (Y_t^\star)_{t\in [0,1]}$. Since under the same parameter $f$, $(n(Y_{i/n}^\star - Y_{(i-1)/n}^\star))_{i\in [n]}$ under $\mathcal{N}_n^\star$ is identically distributed as $(y_i)_{i\in [n]}$ under $\mathcal{M}_n$, by Theorem 7 we have exact sufficiency and conclude that $\Delta(\mathcal{M}_n, \mathcal{N}_n^\star)=0$. Then the rest follows from the triangle inequality. $\Box$

3.4. Equivalence between Density Estimation and Gaussian White Noise Models

Another widely-used model in nonparametric statistics is the density estimation model, where samples $X_1,\cdots,X_n$ are i.i.d. drawn from some unknown density $f$. Typically some smoothness condition is also necessary for the density, and we assume that $f\in \mathcal{H}^s(L)$ again belongs to the H\”{o}lder ball.

Compared with the previous results, a slightly more involved result is that the density estimation model, albeit with a seemingly different form, is also asymptotically equivalent to a proper Gaussian white noise model. However, here the Gaussian white noise model should take the following different form:

$dY_t = \sqrt{f(t)}dt + \frac{1}{2\sqrt{n}}dB_t, \qquad t\in [0,1]. \ \ \ \ \ (11)$

In other words, in nonparametric statistics the problems of density estimation, regression and estimation in Gaussian white noise are all asymptotically equivalent, under certtain smoothness conditions.

Let $\mathcal{M}_n, \mathcal{N}_n$ be the density estimation model and the Gaussian white noise model in (11), respectively. The main result is summarized in the following theorem.

Theorem 11 If $s>1/2$ and the density $f$ is bounded below from zero everywhere, then $\lim_{n\rightarrow\infty} \Delta(\mathcal{M}_n, \mathcal{N}_n)=0$

Proof: Instead of the original density estimation model, we actually consider a Poissonized sampling model $\mathcal{M}_{n,P}$ instead, where the observation under $\mathcal{M}_{n,P}$ is a Poisson process $(Z_t)_{t\in [0,1]}$ on $[0,1]$ with intensity $nf(t)$. Similar to the proof of Theorem 8, we have $\Delta(\mathcal{M}_n, \mathcal{M}_{n,P})\rightarrow 0$ and it remains to show that $\Delta(\mathcal{N}_n, \mathcal{M}_{n,P})\rightarrow 0$

Fix an equal-spaced grid $t_0=0,t_1, t_2, \cdots, t_m=1$ in $[0,1]$ with $m=n^{1-\varepsilon}$, where $\varepsilon>0$ is a small constant depending only on $s$. Next we come up with two new models $\mathcal{M}_{n,P}^\star$ and $\mathcal{N}_n^\star$, where the only difference is that the parameter $f$ is replaced by $f^\star$ defined as

$f^\star(t) = \sum_{i=1}^M f(t_i) 1\left(t_{i-1}\le t

As long as $2s(1-\varepsilon)>1$, the same arguments in the proof of Theorem 10 can be applied to arrive at $\Delta(\mathcal{M}_{n,P}^\star, \mathcal{M}_{n,P}), \Delta(\mathcal{N}_{n}^\star, \mathcal{N}_{n})\rightarrow 0$ (for the white noise model, the assumption that $f$ is bounded away from zero ensures the smoothness of $\sqrt{f}$). Hence, it further suffices to focus on the new models and show that $\Delta(\mathcal{M}_{n,P}^\star, \mathcal{N}_n^\star)\rightarrow 0$. An interesting observation is that under the model $\mathcal{M}_{n,P}^\star$, the vector $\mathbf{Z}=(Z_1,\cdots,Z_m)$ with

$Z_i = \sum_{j=1}^N 1(t_{i-1}\le X_j

is sufficient. Moreover, under the model $\mathcal{N}_n^\star$, the vector $\mathbf{Y}=(Y_1,\cdots,Y_m)$ with

$Y_i = \sqrt{nm}(Y_{t_i} - Y_{t_{i-1}}) \sim \mathcal{N}\left(\sqrt{n^\varepsilon f(t_i)}, \frac{1}{4}\right), \quad i\in [m]$

is also sufficient. Further, all entries of $\mathbf{Y}$ and $\mathbf{Z}$ are mutually independent. Hence, the ultimate goal is to find mutual randomizations between $\mathbf{Y}$ and $\mathbf{Z}$ for $f\in \mathcal{H}^s(L)$

To do so, a first attempt would be to find a bijective mapping $Y_i \leftrightarrow Z_i$ independently for each $i$. However, this approach would lose useful information from the neighbors as we know that $f(t_i)\approx f(t_{i+1})$ thanks to the smoothness of $f$. For example, we have $Y_1|Y_1+Y_2 \sim \text{Binomial}(Y_1+Y_2, p)$ with $p = \frac{f(t_1)}{f(t_1) + f(t_2)}\approx \frac{1}{2}$, and $Z_1 - Z_2\sim \mathcal{N}(\mu, \frac{1}{2})$ with $\mu = n^{\varepsilon/2}(\sqrt{f(t_1)} - \sqrt{f(t_2)})\approx 0$. Motivated by this fact, we represent $\mathbf{Y}$ and $\mathbf{Z}$ in the following bijective way (assume that $m$ is even):

$\begin{array}{rcl} \mathbf{Y}' &=& (Y_1, Y_1 + Y_2, Y_3, Y_3 + Y_4, \cdots, Y_{m-1}, Y_{m-1} + Y_m), \\ \mathbf{Z}' &=& (Z_1-Z_2, Z_1+Z_2, \cdots, Z_m-Z_{m-1}, Z_m+Z_{m-1}). \end{array}$

Note that $(Y_1+Y_2,Y_3+Y_4,\cdots,Y_{m-1}+Y_m)$ is again an independent Poisson vector, we may repeat the above transformation for this new vector. Similar things also hold for $\mathbf{Z}'$. Hence, at each iteration we may leave half of the components unchanged, and apply the above transformations to the other half. We repeat the iteration for $\log_2 \sqrt{n}$ times (assuming $\sqrt{n}$ is a power of $2$), so that finally we arrive at a vector of length $m/\sqrt{n} = n^{1/2-\varepsilon}$ consisting of sums. Let $\mathbf{Y}^{(1)}$ (resp. $\mathbf{Z}^{(1)}$) be the final vector of sums, and $\mathbf{Y}^{(2)}$ (resp. $\mathbf{Z}^{(2)}$) be the vector of remaining entries which are left unchanged at some iteration.

Remark 2 Experienced readers may have noticed that these are the wavelet coefficients under the Haar wavelet basis, where superscripts $1$ and $2$ stand for father and mother wavelets, respectively.

Next we are ready to describe the randomization procedure. For notational simplicity we will write $Y_1+Y_2$ as a representative example of an entry in $\mathbf{Y}^{(1)}$, and write $Y_1$ as a representative example of an entry in $\mathbf{Y}^{(2)}$

For entries in $\mathbf{Y}^{(1)}$, note that by the delta method, for $Y\sim \text{Poisson}(\lambda)$, the random variable $\sqrt{Y}$ is approximately distributed as $\mathcal{N}(\sqrt{\lambda},1/4)$ (in fact, the squared root is the variance-stabilizing transformation for Poisson random variables). The exact transformation is then given by

$Y_1 + Y_2 \mapsto \text{sign}(Y_1 + Y_2 +U)\cdot \sqrt{|Y_1 + Y_2 + U|}, \ \ \ \ \ (12)$

where $U\sim \text{Uniform}([-1/2,1/2])$ is an independent auxiliary variable. The mapping (12) is one-to-one and can thus be inverted as well.

For entries in $\mathbf{Y}^{(2)}$, we aim to use quantile transformations to convert $\text{Binomial}(Y_1+Y_2, 1/2)$ to $\mathcal{N}(0,1/2)$. For $k\ge 0$, let $F_k$ be the CDF of $\text{Binomial}(k, 1/2)$, and $\Phi$ be the CDF of $\mathcal{N}(0,1)$. Then the one-to-one quantile transformation is given by

$Y_1 \mapsto \frac{1}{\sqrt{2}}\Phi^{-1}(F_{Y_1+Y_2}(Y_1+U)), \ \ \ \ \ (13)$

where again $U\sim \text{Uniform}([-1/2,1/2])$ is an independent auxiliary variable. The output given by (13) will be expected to be close in distribution to $Z_1-Z_2$, and the overall transformation is also invertible.

The approximation properties of these transformations are summarized in the following theorem.

Theorem 12 Sticking to the specific examples of $Y_1$ and $Y_1 + Y_2$, let $P_1, P_2$ be the respective distributions of the RHS in (12) and (13), and $Q_1, Q_2$ be the respective distributions of $Z_1 + Z_2$ and $Z_1 - Z_2$, we have

$\begin{array}{rcl} H^2(P_1, Q_1) & \le & \frac{C}{n^\varepsilon (f(t_1) + f(t_2))}, \\ H^2(P_2, Q_2) & \le & C\left(\frac{f(t_1)-f(t_2)}{f(t_1)+f(t_2)} \right)^2 + Cn^\varepsilon \left(\frac{f(t_1)-f(t_2)}{f(t_1)+f(t_2)} \right)^4. \end{array}$

where $C>0$ is some universal constant, and $H^2(P,Q) := \int (\sqrt{dP}-\sqrt{dQ})^2$ denotes the Hellinger distance.

The proof of Theorem 12 is purely probabilitistic and involved, and is omitted here. Applying Theorem 12 to the vector $\mathbf{Y}^{(1)}$ of length $m/\sqrt{n}$, each component is the sum of $\sqrt{n}$ elements bounded away from zero. Consequently, let $\mathsf{K}$ be the overall transition kernel of the randomization, the inequality $H^2(\otimes_i P_i, \otimes_i Q_i)\le \sum_i H^2(P_i,Q_i)$ gives

$H^2(\mathsf{K}P_{\mathbf{Y}^{(1)}}, P_{\mathbf{Z}^{(1)}}) = O\left( \frac{m}{\sqrt{n}}\cdot \frac{1}{n^\varepsilon \cdot n^{1/2}} \right) = O(n^{-2\varepsilon}) \rightarrow 0.$

As for the vector $\mathbf{Y}^{(2)}$, the components lie in $\ell_{\max} := \log_2 \sqrt{n}$ possible different levels. At level $\ell\in [\ell_{\max}]$, the spacing of the grid becomes $n^{-1+\varepsilon}\cdot 2^{\ell}$, and there are $m\cdot 2^{-\ell}$ elements. Also, we have $(f(t_1)-f(t_2))/(f(t_1)+f(t_2))=O(n^{(\varepsilon-1)s'}\cdot 2^{\ell s'})$ at $\ell$-th level, with $s':= s\wedge 1$. Consequently,

$\begin{array}{rcl} H^2(\mathsf{K}P_{\mathbf{Y}^{(2)}}, P_{\mathbf{Z}^{(2)}}) &\le & \sum_{\ell=1}^{\ell_{\max}} m2^{-\ell}\cdot O\left((n^{(\varepsilon-1)s'}\cdot 2^{\ell s'})^2 + n^\varepsilon 2^\ell\cdot (n^{(\varepsilon-1)s'}\cdot 2^{\ell s'})^4 \right) \\ &= & O\left(n^{-(1-\varepsilon)(2s'-1)}\cdot 2^{(2s'-1)\ell_{\max}} + n\cdot (n^{-1+\varepsilon}2^{\ell_{\max}})^{4s'} \right) \\ &=& O\left(n^{-(1/2-\varepsilon)(2s'-1)} + n^{1-2s'(1-2\varepsilon)} \right). \end{array}$

Since $s'>1/2$, we may choose $\varepsilon$ to be sufficiently small (i.e., $2s'(1-2\varepsilon)>1$) to make $H^2(\mathsf{K}P_{\mathbf{Y}^{(2)}}, P_{\mathbf{Z}^{(2)}}) = o(1)$. The proof is completed. $\Box$

4. Bibliographic Notes

The statistical decision theory framework dates back to Wald (1950), and is currently the elementary course for graduate students in statistics. There are many excellent textbooks on this topic, e.g., Lehmann and Casella (2006) and Lehmann and Romano (2006). The concept of model deficiency is due to Le Cam (1964), where the randomization criterion (Theorem 5) was proved. The present form is taken from Torgersen (1991). We also refer to the excellent monographs by Le Cam (1986) and Le Cam and Yang (1990).

The asymptotic equivalence between nonparametric models has been studied by a series of papers since 1990s. The equivalence between nonparametric regression and Gaussian white noise models (Theorem 10) was established in Brown and Low (1996), where both the fixed and random designs were studied. It was also shown in a follow-up work (Brown and Zhang 1998) that these models are non-equivalent if $s\le 1/2$. The equivalence of the density estimation model and others (Theorem 11) was established in Brown et al. (2004), and we also refer to a recent work (Ray and Schmidt-Hieber 2016) which relaxed the crucial assumption that the density is bounded below from zero. Poisson approximation or Poissonization is a well-known technique widely used in probability theory, statistics and theoretical computer science, and the current treatment is essentially taken from Brown et al. (2004).

1. Abraham Wald, Statistical decision functions. Wiley, 1950.
2. Erich L. Lehmann and George Casella, Theory of point estimation. Springer Science & Business Media, 2006.
3. Erich L. Lehmann and Joseph P. Romano, Testing statistical hypotheses. Springer Science & Business Media, 2006.
4. Lucien M. Le Cam, Sufficiency and approximate sufficiency. The Annals of Mathematical Statistics (1964): 1419-1455.
5. Erik Torgersen, Comparison of statistical experiments. Cambridge University Press, 1991.
6. Lucien M. Le Cam, Asymptotic methods in statistical theory. Springer, New York, 1986.
7. Lucien M. Le Cam and Grace Yang, Asymptotics in statistics. Springer, New York, 1990.
8. Lawrence D. Brown and Mark G. Low, Asymptotic equivalence of nonparametric regression and white noise. The Annals of Statistics 24.6 (1996): 2384-2398.
9. Lawrence D. Brown and Cun-Hui Zhang, Asymptotic nonequivalence of nonparametric experiments when the smoothness index is $1/2$. The Annals of Statistics 26.1 (1998): 279-287.
10. Lawrence D. Brown, Andrew V. Carter, Mark G. Low, and Cun-Hui Zhang, Equivalence theory for density estimation, Poisson processes and Gaussian white noise with drift. The Annals of Statistics 32.5 (2004): 2074-2097.
11. Kolyan Ray, and Johannes Schmidt-Hieber, The Le Cam distance between density estimation, Poisson processes and Gaussian white noise. arXiv preprint arXiv:1608.01824 (2016).

# Lecture 2: Importance of Problem Structures

(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.)

This lecture continues the introduction of information-theoretic lower bounds without formally developing tools. The emphasis of this lecture is that, the dependence of the fundamental limits on the problem structure may be very subtle, where the results may change significantly even with tiny changes in the problem setting. Hence, one should be very careful when applying some known lower bounds to similar problems, and it is vital to fully understand the specific problem structure. We will use a specific example of communication complexity in the theoretical computer science literatue to illustrate this phenomenon.

1. Equality Evaluation: Deterministic Communication Complexity

Alice and Bob hold vectors $x,y \in \{0,1\}^n$ respectively, and they would like to check whether $x=y$ or not. However, they live very far from each other, and some limited resources of communication are available. What is the minimum number of communication bits to perform this task reliably, in the sense that they always find the correct answer for any pairs $(x,y)$? In other words, what is the deterministic communication complexity of the distributed computing problem for the equality function $f(x,y)=1(x=y)$

To answer this question, we need to formally define all possible communication protocols. Here we define a general class of blackboard communication protocols.

Definition 1 (Blackboard Communication Protocol) A blackboard communication protocol over domain $X\times Y$ with range $Z$ is a binary tree where each internal node $v$ is labeled by a function $a_v: X\rightarrow \{0,1\}$ or by a function $b_v: Y\rightarrow \{0,1\}$, and each leaf is labeled by an element $z\in Z$.

The value of the protocol on input $(x,y)$ is the label of the leaf reached by starting from the root and walking on the tree using the following rule: at each internal node labeled by $a_v$, walk left if $a_v(x)=0$ and right if $a_v(x)=1$; at each internal node labeled by $b_v$, walk left if $b_v(y)=0$ and right if $b_v(y)=1$. The cost of the protocol is the height of the tree.

Intuitively speaking, there is a blackboard in front of both Alice and Bob in a blackboard communication protocol on which they can write binary bits, and this process continues sequentially where later bits can depend on the entire history. The formal definition above simply uses a binary tree to keep track of the history, and uses functions $a_v, b_v$ to illustrate who will be writing on the blackboard at stage $v$. The communication cost of a given protocol is simply the maximum length of the message written on the blackboard, where the maximum is taken over all possible input pairs $(x,y)\in X\times Y$

Now the question we are asking is: among all blackboard communication protocols which always give the correct evaluation of the equality function, what is the minimum communication cost? This is also known as the deterministic communication complexity, where “deterministic” here means zero-error.

To prove a lower bound on the deterministic complexity, we need to make the following key observation:

Lemma 2 (Copy-paste Property) If a blackboard communication protocol arrives at the same leaf node $v$ on both inputs $(x_1,y_1)$ and $(x_2,y_2)$, then on input $(x_1,y_2)$ the protocol also arrives at $v$

Proof: For each internal nodes $v'$ labeled by $a_{v}(x)$ on the path from the root to $v$, the condition ensures that $a_{v'}(x_1) = a_{v'}(x_2)$. Similarly, $b_{v'}(y_1)=b_{v'}(y_2)$ if the internal node is labeled by $b_{v'}(y)$. Then the result is immediate. $\Box$

The above copy-paste property shows the following limitation of any blackboard communication protocols: the set of all input pairs $(x,y)$ arriving at any leaf node forms a rectangle, i.e., takes the form of $X_v\times Y_v$. Hence, if the value of the function to be computed is not a constant on large rectangles, each leaf node will need to take care of small rectangles and more leaves are required. This is the key insight of the following theorem.

Theorem 3 (Log-rank Inequality) Let $M\in \{0,1\}^{X\times Y}$ be a binary matrix defined as $M(x,y)=f(x,y)$ for all $x\in X, y\in Y$. Then the deterministic communication complexity of computing $f$ is at least $\log_2 \text{\rm rank}(M)$, where the linear rank is understood over ${\mathbb R}$

Proof: For any leaf node $v$ which outputs $1$, define $M_v(x,y) = 1(v\text{ is reached on input }(x,y))$. Then clearly $M= \sum_v M_v$. By the sub-additivity of rank, we have

$\text{rank}(M) \le \sum_v \text{rank}(M_v). \ \ \ \ \ (1)$

However, by Lemma 2, we have $\text{rank}(M_v)=1$ for any leaf, so inequality (1) implies that the number of leaf nodes is at least $\text{rank}(M)$, and we are done. $\Box$

Applying Theorem 3 to the equality function, the matrix $M$ is a $2^n\times 2^n$ identity matrix, and therefore the deterministic communication complexity is at least $n$. This is essentially tight: Alice can communicate the entire vector $x$ to Bob using $n$ bits and Bob then sends the 1-bit evaluation of the equality function back to Alice. Hence, the deterministic communication complexity of equality evaluation is $\Theta(n)$

2. Equality Evaluation: Randomized Private-coin Communication Complexity

Next we consider a slight variant of the previous problem: what is the communication complexity if Alice and Bob can tolerate some maximum probability of error $\varepsilon>0$ on all inputs? This quantity is called the randomized communication complexity. Here to introduce randomness, we assume that both Alice and Bob have access to some (private, not shared) random number generators. We also assume that $\varepsilon$ is a small constant (say, $\varepsilon=0.01$) and suppress the dependence of the communication complexity on $\varepsilon$

A striking result is that, if we allow some small probability of error (even of the order $1/\text{poly}(n)$), the randomized communication complexity drops from $\Theta(n)$ to $O(\log n)$. Here is a communication protocol: by Bertrand’s postulate there exists some prime number $p\in [n^2,2n^2]$. Alice chooses such a $p$ and $z\sim \text{Unif}([p])$, evaluates the polynomial

$q(x,z) = x_0 + x_1z + \cdots + x_{n-1}z^{n-1} \pmod p, \ \ \ \ \ (2)$

and sends $(p,z,q(x,z))$ to Bob using $O(\log n)$ bits, where $x=(x_0,x_1,\cdots,x_{n-1})$. Bob then evaluates $q(y,z)$ in (2) and outputs $1(q(x,z)=q(y,z))$. Clearly, if $x=y$ then Bob always outputs $1$ correctly. If $x\neq y$, the map $z\mapsto q(x,z) - q(y,z)$ is a non-zero polynomial with degree at most $n$, so it has at most $n$ zeros in $\mathbb{F}_p$. Consequently, Bob makes mistakes with probability at most $n/p\le 1/n$ for any $x\neq y$.

The previous upper bound shows that Theorem 3 no longer holds for the randomized communication complexity. In fact, a careful inspection of the proof shows that the zero-error property is crucial for Theorem 3 to hold. However, we may still perform a deterministic simulation of any randomized protocol and reduce to the previous case.

Theorem 4 Let $D(f)$ and $R(f)$ be the deterministic and randomized private-coin communication complexity of computing $f$, respectively. Then

$D(f) \le 2^{R(f)}\left(-\log_2 \left(\frac{1}{2}-\varepsilon\right) + R(f) \right). \ \ \ \ \ (3)$

Proof: Given a randomized protocol, Alice can transmit the probabilities of arriving at all $2^{R(f)}$ leaf nodes based on her input $x$. Given these probabilities, Bob can privately evaluate the probabilities of arriving at each leaf based on the inputs $(x,y)$ (which is made possible by the product law of independent events thanks to the private-coin assumption). Then Bob sums up probabilities for leaf nodes which output $0$ and $1$, respectively, and takes the majority vote. Since the error probability of the randomized protocol is at most $\varepsilon<1/2$, the final output given by the majority vote is always correct. The proof of (3) is completed by noting that these probabilities sent by Alice can be made within precision $(\frac{1}{2}-\varepsilon)2^{-R(f)}$ without hurting the majority vote. $\Box$

Applying Theorem 4 to the equality function $f$, we immediately obtain that $R(f) = \Omega(\log n)$. Hence, the randomized private-coin communication complexity of equality evaluation is $\Theta(\log n)$

3. Equality Evaluation: Randomized Public-coin Communication Complexity

We ask again the same question in the previous section, but now Alice and Bob have shared randomness. Careful readers may have noticed a subtle point used in the proof of Theorem 4, i.e., the identity $\mathop{\mathbb P}(A\cap B)=\mathop{\mathbb P}(A)\mathop{\mathbb P}(B)$ for events $A,B$ given by independent private randomness. Hence, the $\Omega(\log n)$ lower bound does not directly apply to the public-coin scenario, and the public-coin protocols may perform much better. This is indeed the case – the randomized public-coin communication complexity is actually $\Theta(1)$

There is nothing to prove for the lower bound. For the upper bound, here is a simple procotol: both Alice and Bob draw (the same collection of) independent vectors $z_1, \cdots, z_m\sim \text{Unif}(\mathbb{F}_2^n)$. Then Alice sends an $m$-bit vector $(v_1^\top x, v_2^\top x, \cdots, v_m^\top x)$ to Bob, and Bob claims that $x=y$ if and only if $v_i^\top x = v_i^\top y$ for all $i\in [m]$. Clearly, this protocol errs with probability at most $2^{-m}$ on all inputs. Hence, the randomized public-coin communication complexity of equality evaluation is $\Theta(1)$

4. Equality Evalution: Randomized One-round Private-coin Communication Complexity

Now we consider a final variant of the distributed equality evaluation problem, where we assume a randomized private-coin protocol but restrict the protocol to be only one-round. Recall that the previous blackboard communication protocol allows for multiple rounds of interaction, i.e., Alice may provide feedbacks to the messages transmitted by Bob, and vice versa. In the one-round scenario, Alice and Bob can only send $k$-bit messages once to a central referee, and the target is to minimize the communication cost $k$ while controlling the maximum error probability below $\varepsilon$. Then what is the communication complexity in this scenario?

Surprisingly, the communication complexity is $\Theta(\sqrt{n})$, which is different from all previous ones. We first prove this lower bound. Notice that the one-round protocol has a more concise representation than a binary tree, i.e., any randomized strategy of Alice or Bob may be written as stochastic mappings from $[2^n]$ to $[2^k]$. In particular, the copy-paste property in Lemma 2 becomes insufficient for our purposes. Let $A,B\in [0,1]^{2^n\times 2^k}$ be the transition matrices of the stochastic mappings of Alice and Bob, respectively, and $R\in [0,1]^{2^k\times 2^k}$ be the matrix with $R(m_A,m_B)$ being the probability that the referee outputs $1$ given messages $(m_A, m_B)$. Now $(ARB^\top)_{x,y}$ denotes the probability of outputting $1$ on inputs $(x,y)$, the correctness property ensures that

$\|\text{vec}(ARB^\top - I_{2^n}) \|_\infty \le \varepsilon, \ \ \ \ \ (4)$

where $I_m$ denotes the $m\times m$ identity matrix, and $\|\cdot \|_\infty$ denotes the $L_\infty$ norm of vectors. Hence, (4) implies that $ARB^\top$ is an approximate non-negative matrix factorization of the identity matrix, and $2^k$ is the approximate non-negative rank of the identity matrix. The following result in linear algebra provides a lower bound on $k$

Theorem 5 Let $0<\varepsilon\le 0.1$. If there exist stochastic matrices $A,B\in [0,1]^{N\times m}$ and a non-negative matrix $R\in [0,1]^{m\times m}$ such that $\|\text{\rm vec}(ARB^\top - I_N)\|_\infty\le \varepsilon$, then

$m \ge 2^{\Omega(\sqrt{\log N})}.$

The proof of Theorem 5 is quite involved, and we refer interested readers to the reference cited in bibliographic notes. Now $k=\Omega(\sqrt{n})$ is an immediate consequence of (4) and Theorem 5.

To show the upper bound $k=O(\sqrt{n})$, the idea is to simulate a randomized protocol by deterministic ones. Recall that in the previous section, there is a randomized public-coin protocol which requires $O(1)$ bits of communication. Let $R_1, \cdots, R_m$ be $m$ independent deterministic realizations of the randomized protocol with maximum error probability $\varepsilon$, then by Hoeffding’s inequality, for any fixed input $(x,y)$ we have

$\mathop{\mathbb P}\left(\frac{1}{m}\sum_{i=1}^m 1(R_i(x,y) \neq 1(x=y)) \ge 2\varepsilon \right) \le 2\exp\left(-2m\varepsilon^2\right). \ \ \ \ \ (5)$

Hence, by (5) and the union bound over $2^{2n}$ possible inputs, for fixed $\varepsilon>0$ we may simply choose $m=O(n)$ such that the average probability of error of $m$ deterministic protocols on each inputs is at most $2\varepsilon$

Now the private-coin protocol is as follows: put the above $m$ deterministic protocols in a $\sqrt{m}\times\sqrt{m}$ matrix. Alice draws $i\sim \mathsf{Unif}([\sqrt{m}])$, runs all protocols in $i$-th row and sends all outputs to the referee (together with $i$). Similarly, Bob draws $j\sim \mathsf{Unif}([\sqrt{m}])$ and runs all protocols in $j$-th column. The central referee simply looks at the protocol at the $(i,j)$-th entry, whose probability of error is exactly the average probability of error of these $m$ deterministic protocols, which is at most $2\varepsilon$. Meanwhile, the communication complexity of this protocol is $O(\sqrt{m}) = O(\sqrt{n})$, as desired.

5. Bibliographic Notes

Communication complexity is a central topic in theoretical computer science. The blackboard communication protocol is proposed by Yao in 1979, where some central tools (including Theorems 3 and 4) are available in the excellent book by Kushilevitz and Nisan (1996). The proof of Theorem 5 is given by Newman and Szegedy (1996), and the simulation idea is due to Newman (1991).

1. Andrew Chi-Chih Yao, Some complexity questions related to distributive computing (preliminary report). In Proceedings of the eleventh annual ACM symposium on Theory of computing, pages 209–213. ACM, 1979.
2. Eyal Kushilevitz and Noam Nisan, Communication complexity. Cambridge University Press. 1996.
3. Ilan Newman and Mario Szegedy, Public vs private coin flips in one round communication games, In Proceedings of the 28th annual ACM symposium on Theory of computing, pages 561–570. ACM, 1996.
4. Ilan Newman, Private vs. common random bits in communication complexity. Information processing letters 39.2 (1991): 67–71.

# Lecture 1: Introduction to Information-theoretic Lower Bounds via Puzzles

(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at yjhan@stanford.edu.)

1. Overview

Most scientific research are devoted to developing new tools or findings for challenging problems arising in practice, which help to provide explicit solutions to real problems. These works are typically called constructive, and characterize tasks which are possible to accomplish. However, there is also another line of work which focuses on the other side, i.e., provides fundamental limits on certain tasks. In other words, these works are mostly non-constructive and illustrate targets which are impossible to achieve. Such results are also useful for a number of reasons; we may certify the fundemental limits and optimality of some procedures so that it becomes unnecessary to look for better ones, we may understand the limitations of the current framework, and we may compute the amount of resources necessary to complete certain tasks.

Among different types of fundamental limits (or converse in information theory, lower bounds broadly used in a number of fields), throughout these lectures we will be interested in so-called information-theoretic ones. When it comes to the information-theoretic lower bounds, typically one makes observations (either passively or actively) subject to some prescribed rules and argues that certain tasks cannot be accomplished using any tool solely based on these observations. To show these impossibility results, one argues through the line that the amount of information contained in these observations is insufficient to perform these tasks. Therefore, the limited-information structure is of utmost importance in the previous arguments, and the structure occurs in a number of fields including statistics, machine learning, optimization, reinforcement learning, to name a few; e.g., a given size of samples to perform inference, a given amount of training data to learn a statistical model, a given number of gradient evaluations to optimize a convex function, a given round of adaptivity to search for a good policy, etc.

These lectures are devoted to providing an overview of (classical and modern) tools to establish information-theoretic lower bounds. We will neither be restricted to certain problems (e.g., statistical inference, optimization, bandit problems), nor restrict ourselves to tools in information theory. Instead, we will try to present an extensive set of tools/ideas which are suitable for different problem structures, followed by numerous interdisplinary examples. We will see that certain tools/ideas can be applied to many seemingly unrelated problems.

2. Usefulness of Lower Bounds

We ask the following question: why do we care about lower bounds? We remark that the usefulness of lower bounds is not restricted to providing fundamental limits and telling people which are impossible. In fact, the power of understanding lower bounds lies more on the upper bound in the sense that it helps us to understand the problem structure better, including figuring out the most difficult part of the problem and the most essential part of information which should be made full use of. In other words, the lower bounds are interwined with the upper bounds, and should be in no means treated as an independent component. We elaborate on this point via the following examples from puzzle games.

2.1. Example I: Card Guessing Game

There is a magic show as follows: there is a $52$-card deck thoroughly shuffled by the audience. Alice draws $5$ cards from the top of the deck and reveals $4$ of them to Bob one after one, and then Bob can always correctly guess the remaining card in Alice’s hand. How can Alice and Bob achieve that?

Instead of proposing an explicit strategy of Alice and Bob, let us look at the information-theoretic limit of this game first. Suppose the deck consists of $n$ cards, what are the possible values of $n$ such that such a strategy still exists? To prove such a bound, we need to understand what are the possible strategies. From Alice’s side, her strategy is simply a mapping $f$ from unordered $5$-tuples to an ordered $4$-tuple, with an additional restriction that $f(A)\subseteq A$ for any unordered $5$-tuples $A\subseteq [n]$. From Bob’s side, his strategy is another mapping $g$ from ordered $4$-tuples to a specific card in $[n]$. Finally, the correcntess of Bob’s guessing corresponds to

$\{ g(f(A))\} \cup f(A) = A, \qquad \forall A\subseteq [n], |A|=5. \ \ \ \ \ (1)$

An equivalent way to state (1) is that, Bob can recover all $5$ cards after observing the first ordered 4 cards; write $h$ for this strategy. Now we come to our information-theoretic observation: do ordered $4$-tuples contain enough information to recover any unordered $5$-tuples? Here we will quantify the information as cardinality: mathematically speaking, the mapping $h$ must be surjective, so $|\text{dom}(h)|\ge |\text{range}(h)|$. In other words,

$4!\cdot \binom{n}{4} \ge \binom{n}{5} \Longrightarrow n \le 124. \ \ \ \ \ (2)$

Hence, this magic show will fail for any deck with more than $124$ cards. The next question is that, is $n=124$ achievable? Now we are seeking an upper bound, but the previous lower bound still helps – the equality in (2) holds implies that $h$ must be a bijection! Keeping the bijective nature of $h$ in mind, it then becomes not too hard to propose the following strategy which works. Label all cards by $1,2,\cdots,124$, and suppose the chosen cards be $c_1. Alice computes the sum $s=\sum_{i=1}^5 c_i \pmod 5$, keeps $c_s$ and reveals all others $d_1 to Bob. Let $t=\sum_{i=1}^4 d_i \pmod 5$, then for Bob to decode $c_s$, he only needs to solve the following equation:

$c = - t + \min\{i\in [4]: c

where we define $\min\emptyset = 5$. It is easy to show that (3) always admits $24$ solutions in $[124]$, and any solution can be encoded using $4!=24$ different permutations of the revealed cards $(d_1,\cdots,d_4)$

2.2. Example II: Coin Flipping Game

Now Alice and Bob play another game cooperatively: consider $n$ coins on the table each of which may be head or tail, and the initial state is unknown to both Alice and Bob. The audience tells a number in $[n]$ to Alice, and she comes to the table, looks at all coins and flips one of them. Bob then comes to the table and tells the number told by the audience correctly. The question is: for which values of $n$ can they come up with such a strategy?

To answer this question, again we need to understand the structure of all strategies. Let $s\in \{0,1\}^n$ be the initial state of the coins, which may be arbitrary. We may also identify $s$ as vertices of an $n$-dimensional hypercube. Let $k\in [n]$ be the number given by the audience. Alice’s strategy is then a mapping $f: \{0,1\}^n\times [n]\rightarrow \{e_1,\cdots,e_n\}$, choosing a coin to flip (where $e_i$ denotes the $i$-th canonical vector) based on her knowledge. Then Bob’s strategy is a mapping $g: \{0,1\}^n\rightarrow [n]$, and the correctness condition implies that

$g(s \oplus f(s,k)) = k, \qquad \forall s\in\{0,1\}^n, k\in[n]. \ \ \ \ \ (4)$

It is clear from (4) that the map $k\mapsto f(s,k)$ for any $s$ must be injective. By cardinality arguments, this map is further bijective. Now the problem structure becomes more transparent: for each $k\in [n]$, let $g^{-1}(k)\subseteq \{0,1\}^n$ be the states where Bob will claim $k$. We call these vertices in the hypercube have $k$-th color. Then (4) states that, for each vertex $s\in \{0,1\}^n$, its $n$ different neighbors $\{s\oplus e_i \}_{i\in [n]}$ must have $n$ different colors. The converse is also true: if there exists such a coloring scheme, then we may find $f,g$ such that (4) holds.

Now when does such a coloring scheme exist? A simple idea is that, the number of vertices with any color should be the same by double counting arguments. Hence, we must have $n \mid 2^n$, which implies that $n=2^m$ must be a power of $2$. Based on the coloring intuition given by the lower bound, the strategy also becomes simple: consider any finite Abelian group $G=\{g_1,\cdots,g_n\}$ with $n$ elements, we identify colors as elements of $G$ and let $s\in \{0,1\}^n$ have color

$G(s) = s_1g_1 + s_2g_2 + \cdots + s_ng_n \in G. \ \ \ \ \ (5)$

It is straightforward to verify that (4) holds if and only if $G$ has characteristic $2$, i.e., $2g=0$ for any $g\in G$. By algebra, such an Abelian group exists if and only if $n$ is a power of $2$, e.g., $G=\mathbb{F}_2^m$ when $n=2^m$ (necessity can be shown via taking quotients $G/\{0,g\}$ repeatedly). Hence, such a strategy exists if and only if $n=2^m$ is a power of $2$

2.3. Example III: Hat Guessing Game

Now we look at a hat-guessing game with a more complicated information structure. There are $15$ people who are sitting together, each of whom wears a hat of color red or blue, independently with probability $\frac{1}{2}$. They cannot talk to each other, and they can see the color of all hats except for their own. Now each of them simultaneously chooses to guess the color of his own hat, or chooses to pass. They win if at least one person guesses the color, and all guesses are correct. What is their optimal winning probability of this game?

The answer to this question is $\frac{15}{16}$, which is shocking at the first appearance because it greatly outperforms $\frac{1}{2}$ achieved by the naive guessing scheme where only the first person guesses. To understand this improvement, we first think about the fundamental limit of this problem. Let $s\in \{0,1\}^{15}$ be the sequence of hat colors, and $s$ is called success if they win under state $s$, and failure if fail. Clearly, at each success state $s$, there must be some person $i\in [15]$ who makes the correct guess. However, since he cannot see $s_i$, he will make the same guess even if $s_i$ is flipped, and this guess becomes wrong in the new state. This argument seems to suggest that the $\frac{1}{2}$ winning probability is not improvable, for each success state corresponds to a failure state and therefore at most consistutes half of the states. However, this intuition is wrong since multiple success states may correspond to the same failure state.

Mathematically speaking, let $S$ be the success states and $F$ be the failure states. By the previous argument, there exists a map $f: S\rightarrow F$ which only flips one coordinate. Since there are at most $15$ coordinates which can be flipped, we have $|f^{-1}(s)|\le 15$ for each $s\in F$. Consequently, $|S|\le 15|F|$, and the winning probability is

$\frac{|S|}{|S|+|F|} \le \frac{15}{16}. \ \ \ \ \ (6)$

The above lower bound argument shows that, in order to achieve the optimal winning probability, we must have $|f^{-1}(s)|=15$ for any $s\in F$. Consequently, in any failure state, it must be the case where everyone makes wrong guesses. This crucial observation motivates the following strategy: identify people as elements of $\mathbb{F}_2^4 -\{0\}$. Each person $v$ computes the sums $s_1, s_2$ of the people indices with red and blue hats, respectively. If $v=s_1$, he claims blue; if $v=s_2$, he claims red; otherwise he passes. Then it becomes obvious that a failure occurs if and only if the indices of people with red hats sum into $0\in \mathbb{F}_2^4$, whose probability is $(\frac{1}{2})^4 = \frac{1}{16}$

3. Bibliographic Notes

There are tons of information-theoretic analysis used to solve puzzles. We recommend two books if the readers would like to learn more about funny puzzles:

1. Peter Winkler, Mathematical puzzles: a connoisseur’s collection. AK Peters/CRC Press, 2003.
2. Jiri Matousek, Thirty-three miniatures: Mathematical and Algorithmic applications of Linear Algebra. Providence, RI: American Mathematical Society, 2010.