# Basic Information Theory

Mort YaoBasic information theory:

- Thomas M. Cover and Joy A. Thomas.
.*Elements of Information Theory, 2nd edition*

**Definition 1. (Information entropy; Shannon entropy)** Let \(p(x)\) be the probability distribution of a discrete random variable \(X\) taking values from \(\mathcal{X}\). The *entropy* of random variable \(X\) is defined as \[\operatorname{H}(X) = -\sum_{x \in \mathcal{X}} p(x) \log p(x)\] where \(0 \log 0\) is taken to be \(0\).

When the logarithms in the formula are taken to the base 2, the unit of information is called the *shannon* (or more commonly *bit*, symbolically \(\mathrm{Sh}\)). If a message is made of a sequence of bits, with all possible bit strings equally likely, the message’s information content expressed in shannons is equal to the length of the bit string.

When base-10 logarithms are used, the unit of information is called the *hartley* (or *ban*). When natural logarithms are used, the unit of information is called the *nat*.

\[1\,\mathrm{Sh} \approx 0.693\,\mathrm{nat} \approx 0.301\,\mathrm{Hart}\]

**Definition 2. (Binary entropy function)** Let \(p\) be the bias of a Bernoulli random variable \(X\). The entropy of the distribution of \(X\) is given by \[\operatorname{H}_b(p) = -p \log p - (1-p) \log (1-p)\] where \(0 \log 0\) is taken to be \(0\).

Notice that the binary entropy function \(\operatorname{H}_b(p)\) takes a real number \(p \in [0,1]\) instead of a random variable \(X\) or probability distribution \(p(x)\) as the parameter.

**Lemma 3.** \[\frac{1}{n+1} e^{n \operatorname{H}(\frac{k}{n})} \leq \binom{n}{k} \leq e^{n \operatorname{H}(\frac{k}{n})}\]

**Definition 4. (Joint entropy)** Let \(p(x,y)\) be the joint probability of \(X=x\) and \(Y=y\) occurring together. The *joint entropy* of random variables \(X\) and \(Y\) is defined as \[\operatorname{H}(X,Y) = -\sum_{x \in \mathcal{X}}\sum_{y \in \mathcal{Y}} p(x,y) \log p(x,y)\] where \(0 \log 0\) is taken to be \(0\).

**Lemma 5.** \[\max[\operatorname{H}(X), \operatorname{H}(Y)] \leq \operatorname{H}(X,Y) \leq \operatorname{H}(X) + \operatorname{H}(Y)\]

If \(X\) and \(Y\) are independent, we have \(\operatorname{H}(X,Y) = \operatorname{H}(X) + \operatorname{H}(Y)\), since \(p(x,y) = p(x)p(y)\).

**Definition 6. (Conditional entropy)** Let \(p(x,y)\) be the joint probability of \(X=x\) and \(Y=y\) occurring together. The *conditional entropy* of random variable \(X\) conditioned on \(Y\) is defined as \[\operatorname{H}(X|Y) = \sum_{x \in \mathcal{X}}\sum_{y \in \mathcal{Y}} p(x,y) \log \frac{p(y)}{p(x,y)}\] where \(0 \log \frac{0}{0} = 0\) and \(0 \log \frac{q}{0} = 0\).

**Lemma 7. (Chain rule for conditional entropy)** \[\operatorname{H}(X|Y) = \operatorname{H}(X,Y) - \operatorname{H}(Y)\]

**Theorem 8. (Bayes’ rule for conditional entropy)** \[\operatorname{H}(X|Y) + \operatorname{H}(Y) = \operatorname{H}(Y|X) + \operatorname{H}(X)\]

**Definition 9. (Mutual information)** Let \(p(x,y)\) be the joint probability of \(X=x\) and \(Y=y\) occurring together. The *mutual information* of random variables \(X\) and \(Y\) is defined as \[\operatorname{I}(X;Y) = \sum_{x \in \mathcal{X}}\sum_{y \in \mathcal{Y}} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}\]

Mutual information is a measure of the inherent dependence expressed in the joint distribution of \(X\) and \(Y\) relative to the joint distribution of \(X\) and \(Y\) under the assumption of independence. Therefore, \(\operatorname{I}(X;Y) = 0\) if and only if \(X\) and \(Y\) are independent.

**Lemma 10.**\[\begin{align*} \operatorname{I}(X;Y) &= \operatorname{H}(X) + \operatorname{H}(Y) - \operatorname{H}(X,Y) \\ &= \operatorname{H}(X) - \operatorname{H}(X|Y) = \operatorname{H}(Y) - \operatorname{H}(Y|X) \\ &= \operatorname{H}(X,Y) - \operatorname{H}(X|Y) - \operatorname{H}(Y|X) \end{align*}\]

**Definition 11. (Relative entropy; discrimination information; information gain; Kullback-Leibler divergence)** Let \(p(x)\) and \(q(x)\) be two probability distributions of a random variable \(X\), the *Kullback-Leibler divergence* (or *relative entropy*) of \(p\) with respect to \(q\) is defined as \[\operatorname{D}_\mathrm{KL}(p\|q)
= \operatorname{E}_p\left[ \log\frac{p(X)}{q(X)} \right]
= \sum_{x \in \mathcal{X}} p(x) \log \frac{p(x)}{q(x)}
\] where \(0 \log \frac{0}{0} = 0\), \(0 \log \frac{0}{q} = 0\) and \(p \log \frac{p}{0} = \infty\).

In words, the Kullback-Leibler divergence is the expectation of the logarithmic difference between the probabilities \(p(x)\) and \(q(x)\), where the expectation is taken using the probabilities \(p(x)\). Although it does not satisfy the triangle inequality, it is the way of measuring distances between two probability distributions.

Mutual information can also be expressed as a Kullback–Leibler divergence: \[\operatorname{I}(X;Y) = \operatorname{D}_\mathrm{KL}(p(x,y)\|p(x)p(y))\]

**Definition 12. (Binary kl-divergence)** Let \(p\) and \(q\) be biases of two Bernoulli random variables. The binary kl-divergence is given by \[\operatorname{D}_\mathrm{kl}(p\|q) = \operatorname{D}_\mathrm{KL}([1-p,p]\|[1-q,q])
= p \log \frac{p}{q} + (1-p) \log \frac{1-p}{1-q}\]

**Definition 13. (Cross entropy)** Let \(p(x)\) and \(q(x)\) be two probability distributions of a random variable \(X\), the *cross entropy* of \(p\) and \(q\) is defined as \[\operatorname{H}(p,q) = \operatorname{E}_p[-\log q] = \operatorname{H}(p) + \operatorname{D}_\mathrm{KL}(p\|q)\] For discrete \(p\) and \(q\), \[\operatorname{H}(p,q) = -\sum_{x \in \mathcal{X}} p(x) \log q(x)\]