# Basic Information Theory

Basic information theory:

• Thomas M. Cover and Joy A. Thomas. Elements of Information Theory, 2nd edition.

Definition 1. (Information entropy; Shannon entropy) Let $$p(x)$$ be the probability distribution of a discrete random variable $$X$$ taking values from $$\mathcal{X}$$. The entropy of random variable $$X$$ is defined as $\operatorname{H}(X) = -\sum_{x \in \mathcal{X}} p(x) \log p(x)$ where $$0 \log 0$$ is taken to be $$0$$.

When the logarithms in the formula are taken to the base 2, the unit of information is called the shannon (or more commonly bit, symbolically $$\mathrm{Sh}$$). If a message is made of a sequence of bits, with all possible bit strings equally likely, the message’s information content expressed in shannons is equal to the length of the bit string.

When base-10 logarithms are used, the unit of information is called the hartley (or ban). When natural logarithms are used, the unit of information is called the nat.

$1\,\mathrm{Sh} \approx 0.693\,\mathrm{nat} \approx 0.301\,\mathrm{Hart}$

Definition 2. (Binary entropy function) Let $$p$$ be the bias of a Bernoulli random variable $$X$$. The entropy of the distribution of $$X$$ is given by $\operatorname{H}_b(p) = -p \log p - (1-p) \log (1-p)$ where $$0 \log 0$$ is taken to be $$0$$.

Notice that the binary entropy function $$\operatorname{H}_b(p)$$ takes a real number $$p \in [0,1]$$ instead of a random variable $$X$$ or probability distribution $$p(x)$$ as the parameter.

Lemma 3. $\frac{1}{n+1} e^{n \operatorname{H}(\frac{k}{n})} \leq \binom{n}{k} \leq e^{n \operatorname{H}(\frac{k}{n})}$

Definition 4. (Joint entropy) Let $$p(x,y)$$ be the joint probability of $$X=x$$ and $$Y=y$$ occurring together. The joint entropy of random variables $$X$$ and $$Y$$ is defined as $\operatorname{H}(X,Y) = -\sum_{x \in \mathcal{X}}\sum_{y \in \mathcal{Y}} p(x,y) \log p(x,y)$ where $$0 \log 0$$ is taken to be $$0$$.

Lemma 5. $\max[\operatorname{H}(X), \operatorname{H}(Y)] \leq \operatorname{H}(X,Y) \leq \operatorname{H}(X) + \operatorname{H}(Y)$

If $$X$$ and $$Y$$ are independent, we have $$\operatorname{H}(X,Y) = \operatorname{H}(X) + \operatorname{H}(Y)$$, since $$p(x,y) = p(x)p(y)$$.

Definition 6. (Conditional entropy) Let $$p(x,y)$$ be the joint probability of $$X=x$$ and $$Y=y$$ occurring together. The conditional entropy of random variable $$X$$ conditioned on $$Y$$ is defined as $\operatorname{H}(X|Y) = \sum_{x \in \mathcal{X}}\sum_{y \in \mathcal{Y}} p(x,y) \log \frac{p(y)}{p(x,y)}$ where $$0 \log \frac{0}{0} = 0$$ and $$0 \log \frac{q}{0} = 0$$.

Lemma 7. (Chain rule for conditional entropy) $\operatorname{H}(X|Y) = \operatorname{H}(X,Y) - \operatorname{H}(Y)$

Theorem 8. (Bayes’ rule for conditional entropy) $\operatorname{H}(X|Y) + \operatorname{H}(Y) = \operatorname{H}(Y|X) + \operatorname{H}(X)$

Definition 9. (Mutual information) Let $$p(x,y)$$ be the joint probability of $$X=x$$ and $$Y=y$$ occurring together. The mutual information of random variables $$X$$ and $$Y$$ is defined as $\operatorname{I}(X;Y) = \sum_{x \in \mathcal{X}}\sum_{y \in \mathcal{Y}} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}$

Mutual information is a measure of the inherent dependence expressed in the joint distribution of $$X$$ and $$Y$$ relative to the joint distribution of $$X$$ and $$Y$$ under the assumption of independence. Therefore, $$\operatorname{I}(X;Y) = 0$$ if and only if $$X$$ and $$Y$$ are independent.

Lemma 10. \begin{align*} \operatorname{I}(X;Y) &= \operatorname{H}(X) + \operatorname{H}(Y) - \operatorname{H}(X,Y) \\ &= \operatorname{H}(X) - \operatorname{H}(X|Y) = \operatorname{H}(Y) - \operatorname{H}(Y|X) \\ &= \operatorname{H}(X,Y) - \operatorname{H}(X|Y) - \operatorname{H}(Y|X) \end{align*}

Definition 11. (Relative entropy; discrimination information; information gain; Kullback-Leibler divergence) Let $$p(x)$$ and $$q(x)$$ be two probability distributions of a random variable $$X$$, the Kullback-Leibler divergence (or relative entropy) of $$p$$ with respect to $$q$$ is defined as $\operatorname{D}_\mathrm{KL}(p\|q) = \operatorname{E}_p\left[ \log\frac{p(X)}{q(X)} \right] = \sum_{x \in \mathcal{X}} p(x) \log \frac{p(x)}{q(x)}$ where $$0 \log \frac{0}{0} = 0$$, $$0 \log \frac{0}{q} = 0$$ and $$p \log \frac{p}{0} = \infty$$.

In words, the Kullback-Leibler divergence is the expectation of the logarithmic difference between the probabilities $$p(x)$$ and $$q(x)$$, where the expectation is taken using the probabilities $$p(x)$$. Although it does not satisfy the triangle inequality, it is the way of measuring distances between two probability distributions.

Mutual information can also be expressed as a Kullback–Leibler divergence: $\operatorname{I}(X;Y) = \operatorname{D}_\mathrm{KL}(p(x,y)\|p(x)p(y))$

Definition 12. (Binary kl-divergence) Let $$p$$ and $$q$$ be biases of two Bernoulli random variables. The binary kl-divergence is given by $\operatorname{D}_\mathrm{kl}(p\|q) = \operatorname{D}_\mathrm{KL}([1-p,p]\|[1-q,q]) = p \log \frac{p}{q} + (1-p) \log \frac{1-p}{1-q}$

Definition 13. (Cross entropy) Let $$p(x)$$ and $$q(x)$$ be two probability distributions of a random variable $$X$$, the cross entropy of $$p$$ and $$q$$ is defined as $\operatorname{H}(p,q) = \operatorname{E}_p[-\log q] = \operatorname{H}(p) + \operatorname{D}_\mathrm{KL}(p\|q)$ For discrete $$p$$ and $$q$$, $\operatorname{H}(p,q) = -\sum_{x \in \mathcal{X}} p(x) \log q(x)$