By David Lin (linkewei :at: stanford.edu)
In this article, we explore some applications of information-theoretic methods to various combinatorial problems. The various approaches are best thought of as an extension to the probabilistic method, first initiated by Paul Erdős in the 1940s, where a probability distribution imposed on combinatorial objects (e.g. a randomly selected element of a set) can give rise to deterministic conclusions that are otherwise difficult to obtain.
The probabilistic method usually depends on computing the expected value (where we obtain a “typical” value), or showing that the probability of an event is positive (i.e. the probability of constructing a desired object). Information theory forms a natural extension to this method by offering additional quantities to consider, since entropy and mutual information possess intuitive significance once a random distribution is imposed on the objects.
Information Theory PreliminariesWhen we construct a (discrete) random variable taking values among a set , the entropy is defined as where all logs are base 2 and by convention . Following the analogy, if we have a second random variable taking values on , we have the concepts of joint entropy and conditional entropy : There is a natural information-based interpretation for the above. is a measure of the combined information of and (or equivalently, the measure of the information of as a joint variable), while is the measure of the expected information revealed by about .
These two quantities are also related by the chain rule:
Fact. (Chain Rule) .
(Interpretation: the information of revealing and is equivalent to the information of revealing then revealing .)
A key inequality is as follows:
Fact. (Data Processing Inequality) . (Interpretation: revealing additional information cannot increase entropy)
Remark. As a quick corollary, we have the much simpler , which when combined with the chain rule gives us the subadditivity of entropy.
Entropic Lower BoundsA basic property is that the entropy gives a lower bound on the number of distinct values can take:
This immediately gives us a way to connect entropy and counting: we can bound any count we want by setting up some distribution over the desired object, and then estimating the entropy.
Problem. Let be a bipartite graph with vertex partitions of size and respectively. If is the set of length 3 paths (edges may be repeated), then
Solution. We will select a random length 3 path as follows: is uniformly selected among all edges, then is uniformly selected among all edges at and is uniformly selected among all edges at independent of .
First we compute :
Similarly . Then:
- This is clearly tight for complete graphs.
- The way to invent this solution is firstly to notice the naive argument that gives the RHS: picking randomly among , and shares the -vertex with probability while and shares the -vertex with probability . Hence the probability of being a valid path is if we make the naive assumption that the two events are independent.
This leads us to the correct solution when we impose such a distribution on the set of 3-paths, then the entropic counting bound gives us the right answer.
Problem. Given finite sets and functions , a vector is called nice if for each . Show that the number of nice vectors is at least
Solution. Select uniformly at random, and select among the possible choices (given ) uniformly at random, and so on. The probability of being any particular value of is fixed at . So: which is what we wanted.
Remark. We made use of the QM-AM-GM-HM inequality for random variables:
Entropic Lower BoundsAn alternative argument provides us a way to obtain an upper bound instead of a lower bound. We start with the following observation: if was the uniform distribution, then the upper bound is in fact attained. This suggests that for an upper bound, we should start with the uniform distribution and then bound the entropy by various methods.
A naive way to do so (especially when joint variables are present) is to simply make use of subadditivity. If we are more careful, we could also reveal each variable sequentially and track the incremental information revealed by each variable, expressed by the equation below:
Problem. (Binomial tail bounding) Show the bound (for ) where is the binary entropy function.
Solution. Consider a random -bit string uniformly selected from the set of strings with at most 1’s. We have since and is monotonic on .
Problem. (Discrete Loomis-Whitney) Let be a finite subset of , and let be the projection of onto the -plane (with defined similarly). Then:
Solution. The following entropy estimate is useful (special case of Han’s inequality): This can be established by a short computation: Now select uniformly at random, then which is exactly log of the inequality we needed to show.
- This was proposed at the International Mathematics Olympiad 1992, a contest for high school students. How might they have solved it? Here’s a possible approach: Alternative Solution. Let . By the Cauchy-Schwarz inequality: then we note that there is an injection from pairs with the same coordinates to : .
- What are the equality cases? Looking at the computation above, we require at least: In particular, the second equation tells us that the values for should have the same distribution conditioned the projection onto . The first tells us that conditioned on , forms a grid, so equality holds for “cuboidal sets” (If you look closely, the alternative solution suggests the same equality case).
- triangle if are edges
- vee if are edges
Solution. Select uniformly at random from the set of all triangles. Then: where the last line has an interpretation as the entropy decrements from revealing in that order. However, we can ignore the dependence of on (i.e. by symmetry) to get an upper bound: This suggests constructing a distribution on vees as follows: above and . The result is that is precisely , but it must be upper bounded by .
Discussion. Ordinarily, when we relax to , we would get a distribution of length-2 paths (like ), but we know that every triangle is a length-2 path (and so by following through, the conclusion would have been trivial). However, the magic happens when we make use of cyclic symmetry to “move” to .
Aside from entropy, other information-theoretic concepts also have some relevance to combinatorial problems. For instance, Moser  used a compression argument to prove the Lovasz local lemma, which guarantees that for a collection of independent variables and a sequence of “local” (i.e. depending only on a few variables) and “rare” bad events , all bad events are avoided with some positive probability.
I would like to thank Yanjun for his useful suggestions and advice for this project, and Professor Weissman for teaching this course and introducing me to the wonderful world of information theory.