Basic Information Theory

Mort Yao

Basic information theory:

Thomas M. Cover and Joy A. Thomas. Elements of Information Theory, 2nd edition.

Definition 1. (Information entropy; Shannon entropy) Let \(p(x)\) be the probability distribution of a discrete random variable \(X\) taking values from \(\mathcal{X}\). The entropy of random variable \(X\) is defined as \[\operatorname{H}(X) = -\sum_{x \in \mathcal{X}} p(x) \log p(x)\] where \(0 \log 0\) is taken to be \(0\).

When the logarithms in the formula are taken to the base 2, the unit of information is called the shannon (or more commonly bit, symbolically \(\mathrm{Sh}\)). If a message is made of a sequence of bits, with all possible bit strings equally likely, the message’s information content expressed in shannons is equal to the length of the bit string.

When base-10 logarithms are used, the unit of information is called the hartley (or ban). When natural logarithms are used, the unit of information is called the nat.

\[1\,\mathrm{Sh} \approx 0.693\,\mathrm{nat} \approx 0.301\,\mathrm{Hart}\]

Definition 2. (Binary entropy function) Let \(p\) be the bias of a Bernoulli random variable \(X\). The entropy of the distribution of \(X\) is given by \[\operatorname{H}_b(p) = -p \log p - (1-p) \log (1-p)\] where \(0 \log 0\) is taken to be \(0\).

Notice that the binary entropy function \(\operatorname{H}_b(p)\) takes a real number \(p \in [0,1]\) instead of a random variable \(X\) or probability distribution \(p(x)\) as the parameter.

Lemma 3. \[\frac{1}{n+1} e^{n \operatorname{H}(\frac{k}{n})} \leq \binom{n}{k} \leq e^{n \operatorname{H}(\frac{k}{n})}\]

Definition 4. (Joint entropy) Let \(p(x,y)\) be the joint probability of \(X=x\) and \(Y=y\) occurring together. The joint entropy of random variables \(X\) and \(Y\) is defined as \[\operatorname{H}(X,Y) = -\sum_{x \in \mathcal{X}}\sum_{y \in \mathcal{Y}} p(x,y) \log p(x,y)\] where \(0 \log 0\) is taken to be \(0\).

Lemma 5. \[\max[\operatorname{H}(X), \operatorname{H}(Y)] \leq \operatorname{H}(X,Y) \leq \operatorname{H}(X) + \operatorname{H}(Y)\]

If \(X\) and \(Y\) are independent, we have \(\operatorname{H}(X,Y) = \operatorname{H}(X) + \operatorname{H}(Y)\), since \(p(x,y) = p(x)p(y)\).

Definition 6. (Conditional entropy) Let \(p(x,y)\) be the joint probability of \(X=x\) and \(Y=y\) occurring together. The conditional entropy of random variable \(X\) conditioned on \(Y\) is defined as \[\operatorname{H}(X|Y) = \sum_{x \in \mathcal{X}}\sum_{y \in \mathcal{Y}} p(x,y) \log \frac{p(y)}{p(x,y)}\] where \(0 \log \frac{0}{0} = 0\) and \(0 \log \frac{q}{0} = 0\).

Lemma 7. (Chain rule for conditional entropy) \[\operatorname{H}(X|Y) = \operatorname{H}(X,Y) - \operatorname{H}(Y)\]

Theorem 8. (Bayes’ rule for conditional entropy) \[\operatorname{H}(X|Y) + \operatorname{H}(Y) = \operatorname{H}(Y|X) + \operatorname{H}(X)\]

Definition 9. (Mutual information) Let \(p(x,y)\) be the joint probability of \(X=x\) and \(Y=y\) occurring together. The mutual information of random variables \(X\) and \(Y\) is defined as \[\operatorname{I}(X;Y) = \sum_{x \in \mathcal{X}}\sum_{y \in \mathcal{Y}} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}\]

Mutual information is a measure of the inherent dependence expressed in the joint distribution of \(X\) and \(Y\) relative to the joint distribution of \(X\) and \(Y\) under the assumption of independence. Therefore, \(\operatorname{I}(X;Y) = 0\) if and only if \(X\) and \(Y\) are independent.

Lemma 10. \[\begin{align*} \operatorname{I}(X;Y) &= \operatorname{H}(X) + \operatorname{H}(Y) - \operatorname{H}(X,Y) \\ &= \operatorname{H}(X) - \operatorname{H}(X|Y) = \operatorname{H}(Y) - \operatorname{H}(Y|X) \\ &= \operatorname{H}(X,Y) - \operatorname{H}(X|Y) - \operatorname{H}(Y|X) \end{align*}\]

Definition 11. (Relative entropy; discrimination information; information gain; Kullback-Leibler divergence) Let \(p(x)\) and \(q(x)\) be two probability distributions of a random variable \(X\), the Kullback-Leibler divergence (or relative entropy) of \(p\) with respect to \(q\) is defined as \[\operatorname{D}_\mathrm{KL}(p\|q) = \operatorname{E}_p\left[ \log\frac{p(X)}{q(X)} \right] = \sum_{x \in \mathcal{X}} p(x) \log \frac{p(x)}{q(x)} \] where \(0 \log \frac{0}{0} = 0\), \(0 \log \frac{0}{q} = 0\) and \(p \log \frac{p}{0} = \infty\).

In words, the Kullback-Leibler divergence is the expectation of the logarithmic difference between the probabilities \(p(x)\) and \(q(x)\), where the expectation is taken using the probabilities \(p(x)\). Although it does not satisfy the triangle inequality, it is the way of measuring distances between two probability distributions.

Mutual information can also be expressed as a Kullback–Leibler divergence: \[\operatorname{I}(X;Y) = \operatorname{D}_\mathrm{KL}(p(x,y)\|p(x)p(y))\]

Definition 12. (Binary kl-divergence) Let \(p\) and \(q\) be biases of two Bernoulli random variables. The binary kl-divergence is given by \[\operatorname{D}_\mathrm{kl}(p\|q) = \operatorname{D}_\mathrm{KL}([1-p,p]\|[1-q,q]) = p \log \frac{p}{q} + (1-p) \log \frac{1-p}{1-q}\]

Definition 13. (Cross entropy) Let \(p(x)\) and \(q(x)\) be two probability distributions of a random variable \(X\), the cross entropy of \(p\) and \(q\) is defined as \[\operatorname{H}(p,q) = \operatorname{E}_p[-\log q] = \operatorname{H}(p) + \operatorname{D}_\mathrm{KL}(p\|q)\] For discrete \(p\) and \(q\), \[\operatorname{H}(p,q) = -\sum_{x \in \mathcal{X}} p(x) \log q(x)\]