$= \operatorname{Var}(X) = \sigma^2$ (**Variance**) | $\hat\mu_2 = 1$ | 3 | $\mu_3 = \operatorname{E}[(X - \operatorname{E}[X])^3]$ | $\hat\mu_3 = \frac{\operatorname{E}[(X-\mu)^3]}{(\operatorname{E}[(X-\mu)^2])^{3/2}}$

$= \operatorname{Skew}(X)$ (**Skewness**) | 4 | $\mu_4 = \operatorname{E}[(X - \operatorname{E}[X])^4]$ | $\hat\mu_4 = \frac{\operatorname{E}[(X-\mu)^4]}{(\operatorname{E}[(X-\mu)^2])^2}$

$= \operatorname{Kurt}(X)$ (**Kurtosis**) **Remark 2.23.** In the case of continuous probability distribution with a probability density function $f$, where the expectation is calculated as $\operatorname{E}[X] = \int_{-\infty}^\infty x f(x) dx$, $\operatorname{E}[X]$ and all other moments can be undefined (Example: Cauchy distribution). For discrete probability distributions, however, all moments are well-defined. **Remark 2.24.** In [statistics](/math/statistics/), when the exact expectations $\operatorname{E}[X]$ and $\operatorname{E}[Y]$ are unknown, the *sample covariance* (similarly for *sample variance* and *sample standard deviation*) is given by $$s^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i-\bar{X})(Y_i-\bar{Y})$$ Notice that $\bar{X}$ and $\bar{Y}$ are *empirical means* (i.e., sample means) rather than expectations. The use of $\frac{1}{n-1}$ instead of $\frac{1}{n}$ is called *Bessel's correction*, which gives an unbiased estimator of the population covariance (and variance). # Concentration Inequalities Concentration inequalities provide bounds on how a random variable deviates from some value (typically one of its moments, e.g., the expectation). **Theorem 3.1. (Markov's inequality)** Let $X$ be a non-negative random variable and $\varepsilon > 0$. Then $$\Pr[X \geq \varepsilon] \leq \frac{\operatorname{E}[X]}{\varepsilon}$$ **Proof.** Let $X$ takes values from the set $\mathcal{X}$. We have \begin{align*} \operatorname{E}[X] &= \sum_{x \in \mathcal{X}} x \Pr[X=x] \\ &\geq \sum_{x \in \mathcal{X}, x<\varepsilon} 0 \cdot \Pr[X=x] + \sum_{x \in \mathcal{X} ,x \geq \varepsilon} \varepsilon \cdot \Pr[X=x] \\ &= \varepsilon \cdot \Pr[X \geq \varepsilon] \end{align*} [QED] Markov's inequality extends to a strictly increasing and non-negative function $\varphi$: $$\Pr[X \geq \varepsilon] = \Pr[\varphi(X) \geq \varphi(\varepsilon)] \leq \frac{\operatorname{E}[\varphi(X)]}{\varphi(\varepsilon)}$$ Markov's inequality is the simplest and weakest upper bound based on the expectation of a random variable. Nevertheless, it forms the basis for many other stronger and more powerful bounds. **Theorem 3.2. (Chebyshev's inequality)** Let $X$ be a random variable and $\delta > 0$. If $\operatorname{E}[X]$ and $\operatorname{Var}(X)$ are bounded, then $$\Pr[|X - \operatorname{E}[X]| \geq \delta] \leq \frac{\operatorname{Var}(X)}{\delta^2}$$ **Proof.** Let $Y=(X-\operatorname{E}[X])^2$ be a non-negative random variable. Using Markov's inequality, we have \begin{align*} \Pr[|X - \operatorname{E}[X]| \geq \delta] &= \Pr[(X - \operatorname{E}[X])^2 \geq \delta^2] \\ &\leq \frac{\operatorname{E}[(X - \operatorname{E}[X])^2]}{\delta^2} = \frac{\operatorname{Var}(X)}{\delta^2} \end{align*} [QED] Chebyshev's inequality provides an upper bound on the probability that a random variable deviates from its expected value. It may be used to show the law of large numbers (LNN), assumed that the variance $\operatorname{Var}(X)$ is finite. **Corollary 3.3.** Let $X_1,\dots,X_n$ be pairwise-independent random variables with the same expectation $\mu$ and variance $\sigma^2$. Then for every $\delta > 0$, $$\Pr\left[\left|\frac{\sum_{i=1}^n X_i}{n} - \mu\right| \geq \delta\right] \leq \frac{\sigma^2}{\delta^2 n}$$ **Proof.** By linearity of expectation, we have $\operatorname{E}\left[\frac{\sum_{i=1}^n X_i}{n}\right] = \mu$. Applying Chebyshev's inequality to the random variable $\frac{\sum_{i=1}^n X_i}{n}$, we have $$\Pr\left[\left|\frac{\sum_{i=1}^n X_i}{n} - \mu\right| \geq \delta\right] \leq \frac{\operatorname{Var}\left(\frac{\sum_{i=1}^n X_i}{n}\right)}{\delta^2}$$ Using pairwise independence (Lemma 2.19), it follows that $$\operatorname{Var}\left(\frac{\sum_{i=1}^n X_i}{n}\right) = \frac{1}{n^2} \sum_{i=1}^n \operatorname{Var}(X_i) = \frac{1}{n^2} \sum_{i=1}^n \sigma^2 = \frac{\sigma^2}{n}$$ [QED] **Theorem 3.4. (Chernoff bound)** Let $X$ be a non-negative random variable. Then for every $\varepsilon > 0$ and $t > 0$, $$\Pr[X \geq \varepsilon] \leq \frac{\operatorname{E}[e^{tX}]}{e^{t\varepsilon}}$$ For every $\varepsilon > 0$ and $t < 0$, $$\Pr[X \leq \varepsilon] \leq \frac{\operatorname{E}[e^{tX}]}{e^{t\varepsilon}}$$ **Lemma 3.5. (Hoeffding's lemma)** Let $X$ be a random variable, such that $\Pr[a \leq X \leq b] = 1$. Then for any $\lambda \in \mathbb{R}$, $$\operatorname{E}[e^{\lambda X}] \leq e^{\lambda \operatorname{E}[X] + \frac{\lambda^2 (b-a)^2}{8}}$$ The function $f(\lambda) = \operatorname{E}[e^{\lambda X}]$ is known as the *moment generating function (mgf)* of $X$, since $f^{(k)}(0) = \operatorname{E}[X^k]$ is the $k$th raw moment of $X$. **Theorem 3.6. (Hoeffding's inequality)** Let $X_1, \dots, X_n$ be independent real-valued random variables, such that for $i \in \{1,\dots,n\}$ there exist $a_i \leq b_i$ such that $\Pr[a_i \leq X_i \leq b_i] = 1$. Then for every $\varepsilon > 0$, $$\Pr\left[\sum_{i=1}^n X_i - \operatorname{E}\left[\sum_{i=1}^n X_i\right] \geq \varepsilon \right] \leq e^{-2\varepsilon^2 / \sum_{i=1}^n (b_i-a_i)^2}$$ and $$\Pr\left[\sum_{i=1}^n X_i - \operatorname{E}\left[\sum_{i=1}^n X_i\right] \leq -\varepsilon \right] \leq e^{-2\varepsilon^2 / \sum_{i=1}^n (b_i-a_i)^2}$$ **Corollary 3.7. (Two-sided Hoeffding's inequality)** $$\Pr\left[\left|\sum_{i=1}^n X_i - \operatorname{E}\left[\sum_{i=1}^n X_i\right]\right| \geq \varepsilon \right] \leq 2e^{-2\varepsilon^2 / \sum_{i=1}^n (b_i-a_i)^2}$$ Hoeffding's inequality provides an upper bound on the probability that the sum of random variables deviates from its expected value. It is also useful to analyze the number of required samples needed to obtain a confidence interval. **Corollary 3.8.** Let $X_1, \dots, X_n$ be independent real-valued random variables, such that for $i \in \{1,\dots,n\}$, $X_i \in [0,1]$ and $\operatorname{E}[X_i] = \mu$. Then for every $\varepsilon > 0$, $$\Pr\left[ \frac{1}{n} \sum_{i=1}^n X_i - \mu \geq \varepsilon \right] \leq e^{-2n\varepsilon^2}$$ and $$\Pr\left[ \mu - \frac{1}{n} \sum_{i=1}^n X_i \geq \varepsilon \right] \leq e^{-2n\varepsilon^2}$$ # Discrete Probability Distributions A discrete probability distribution is characterized by a probability mass function (pmf). ## Bernoulli Distribution The *Bernoulli distribution* is the discrete probability distribution of a random variable which takes the value $1$ with success probability of $p$ and the value $0$ with failure probability of $q=1-p$. A discrete random variable $X$ taking values from $\{0,1\}$ is called a *Bernoulli random variable*. The parameter $p = \Pr[X=1]$ is called the *bias* of $X$. **pmf.** $$f(k;p) = \begin{cases} p & \text{if }k=1, \\ 1-p & \text {if }k=0.\end{cases}$$ Alternatively, $$f(k;p) = p^k (1-p)^{1-k}\!\quad \text{for }k\in\{0,1\}$$ **Mean.** $$\operatorname{E}[X] = 0 \cdot (1-p) + 1 \cdot p = p = \Pr[X=1]$$ **Variance.** $$\operatorname{Var}(X) = p(1-p)$$ Since $p \in [0,1]$, we have $\operatorname{Var}(X) \leq \frac{1}{4}$. **Entropy. (Binary entropy function)** $$\operatorname{H}(p) = -p \log p - (1-p) \log (1-p)$$ ## Binomial Distribution The *binomial distribution* is the discrete probability distribution of the number of successes in a sequence of $n$ independent and identically distributed Bernoulli trials. A discrete random variable $X$ taking value $k \in \{0,1,\dots,n\}$ is called a *binomial random variable* with parameters $n$ and $p$ if it satisfies the following probability distribution: $$\Pr[X=k] = \binom{n}{k} p^k (1-p)^{n-k}$$ **pmf.** $$f(k;n,p) = \Pr[X=k] = \binom{n}{k} p^k(1-p)^{n-k}$$ $$X \sim \text{B}(n,p)$$ **Mean.** $$\operatorname{E}[X] = np$$ **Variance.** $$\operatorname{Var}(X) = np(1-p)$$ **Entropy.** $$\operatorname{H}(n,p) = \frac{1}{2} \log (2 \pi e np(1-p)) + \mathcal{O}\left(\frac{1}{n}\right)$$ Note that a binomial random variable can be represented as a sum of independent, identically distributed Bernoulli random variables: $$X = \sum_{k=1}^n X_k \sim \text{B}(n,p)$$ Thus, the Bernoulli distribution is a special case of binomial distribution $\text{B}(1,p)$. ## Negative Binomial Distribution The *negative binomial distribution* is the discrete probability distribution of the number of successes in a sequence of independent and identically distributed Bernoulli trials before a specified number of failures (denoted $r$) occurs. **pmf.** $$f(k;r,p) = \Pr[X=k] = \binom{k+r-1}{k} (1-p)^r p^k$$ $$X \sim \text{NB}(r,p)$$ **Mean.** $$\operatorname{E}[X] = \frac{pr}{1-p}$$ **Variance.** $$\operatorname{Var}(X) = \frac{pr}{(1-p)^2}$$ When $r$ is an integer, the negative binomial distribution is also known as the *Pascal distribution*. ## Geometric Distribution The *geometric distribution* is the discrete probability distribution of the number of failures in a sequence of independent and identically distributed Bernoulli trials before the first success. **pmf.** $$f(k;p) = (1-p)^k p$$ $$X \sim \text{Geom}(p)$$ **Mean.** $$\operatorname{E}[X] = \frac{1-p}{p}$$ **Variance.** $$\operatorname{Var}(X) = \frac{1-p}{p^2}$$ **Entropy.** $$\operatorname{H}(p) = \frac{-p \log p - (1-p) \log (1-p)}{p}$$ The geometric distribution $\text{Geom}(p)$ is a special case of negative binomial distribution $\text{NB}(1,1-p)$. ## Discrete Uniform Distribution The *discrete uniform distribution* is the probability distribution whereby a finite number of values are equally likely to be observed. **pmf.** $$f(a,b) = \frac{1}{b-a+1}$$ $$X \sim \mathcal{U}\{a,b\}$$ **Mean.** $$\operatorname{E}[X] = \frac{a+b}{2}$$ **Variance.** $$\operatorname{Var}(X) = \frac{(b-a+1)^2-1}{12}$$ **Entropy.** $$\operatorname{H}(p) = \log (b-a+1)$$ ## Summary | | With replacements | No replacements | | ------------------------ | :---------------: | :-------------: | | Given number of draws | *Binomial distribution* (Special case when $n=1$: *Bernoulli distribution*) | *Hypergeometric distribution* | | Given number of failures | *Negative binomial distribution* (Special case when $r=1$: *Geometric distribution*) | *Negative hypergeometric distribution* | # Continuous Probability Distributions A continuous probability distribution is characterized by a probability density function (pdf). ## [Normal Distribution](distributions/normal/) (Gaussian Distribution) ## [Gamma Distribution](distributions/gamma/) (Special cases: chi-squared distribution, exponential distribution) --- Related topics: * [Statistics](/math/statistics/) * [Information theory](/info/) Mathematical theory: * [Measure theory](/math/analysis/measure/) * Advanced probability theory Fields of application: * Algorithms * Computational biology and biostatistics * Cryptography and cryptanalysis * Game theory and operations research * Machine learning * Computer vision and image processing