Distributions

Although the space of possible probability distribution functions is infinite, there are a handful of common distributions that can be used to approximately model a wide range of problems. We will describe these distributions here with an emphasis on motivating where they came from and describing their most useful properties and applications.

Bernoulli Distribution

The Bernoulli distribution is possibly the simplest distribution in statistics. It is a 1-parameter distribution whose range consists only of the set {0, 1}. It is defined simply as:

ber(1p)=pber(0p)=1p\begin{aligned} \operatorname{ber}(1 \mid p) &= p \\ \operatorname{ber}(0 \mid p) &= 1 - p \end{aligned}

The simplest example is a weighted coin flip where heads is valued at 1 and tails at 0: with probability p, the coin lands on heads and with probability 1-p, the coin lands on tails.

A Bernoulli random variable can represent anything that has a binary outcome, such as the outcome of a sports game, the winning of the lottery, or whether or not an event will occur over a fixed period of time. Because it is so versatile, it forms the basis for many of the distributions that will follow.

The Bernoulli distribution is simple because its distribution is fully specified by its definition. For other distributions, it will take more work to go from their definition (such as "a series of coin flips") to the mathematical function representing its probability distribution function.

Binomial Distribution

The binomial distribution is an extension of the Bernoulli distribution: it represents an aggregate of multiple Bernoulli random variables (all with the same probability of heads, pp). Specifically, the binomial distribution is a 2-parameter distribution that describes the number of Bernoulli random variables, nn, whose value is 1 (or "heads" or "up") out of a possible NN.

For example, imagine we have 3 Bernoulli variables, b1b_1, b2b_2, and b3b_3 (each with the same probability of heads pp), and we draw values from them. Possible outcomes of the Bernoulli variables and the total number of 1 outcomes are:

0,0,000,0,110,1,010,1,121,0,011,0,121,1,021,1,130,0,0 \rightarrow 0 \\ 0,0,1 \rightarrow 1 \\ 0,1,0 \rightarrow 1 \\ 0,1,1 \rightarrow 2 \\ 1,0,0 \rightarrow 1 \\ 1,0,1 \rightarrow 2 \\ 1,1,0 \rightarrow 2 \\ 1,1,1 \rightarrow 3 \\

The binomial distribution here is the distribution of the total number of heads. In this case, N=3N=3 and probability of the various outcomes is given by:

binom(0N=3,p)=(1p)(1p)(1p)binom(1N=3,p)=(1p)(1p)p+(1p)p(1p)+p(1p)(1p)binom(2N=3,p)=(1p)pp+p(1p)p+pp(1p)binom(3N=3,p)=ppp\begin{aligned} \operatorname{binom}(0 \mid N=3, p) &= (1-p)(1-p)(1-p) \\ \operatorname{binom}(1 \mid N=3, p) &= (1-p)(1-p)p + (1-p)p(1-p) \\ &\quad + p(1-p)(1-p) \\ \operatorname{binom}(2 \mid N=3, p) &= (1-p)pp + p(1-p)p + pp(1-p) \\ \operatorname{binom}(3 \mid N=3, p) &= ppp \end{aligned}

One create such a table for NN just by enumerating the possibilities, just as we did above with N=3N=3. Thus, it's simple to calculate the Binomial distribution from first principles, starting with the Bernoulli distribution (though it may be quite tedious for large NN).

One could program a computer to perform these calculations, thereby fully specifying the binomial distribution. To make the calculations simpler and easier to write down, one can leverage combinatorics. Doing so, the binomial distribution may be written as:

Binom(nN,p)=N!n!(Nn)!pn(1p)(Nn)Binom(n | N, p) = \frac{N!}{n!(N-n)!}p^n(1-p)^{(N-n)}

Beta Distribution

The Beta Distribution is closely related to the Binomial distribution. As described above, the binomial distribution represents the probability of flipping nn heads out of a total of NN using a weighted coin with probability of heads pp:

p(nN,p)=Binom(n,N,p)p(n | N, p) = Binom(n, N, p)

Imagine we have such a coin but don't know the value of its weight factor pp. We could try to infer it by flipping the coin NN times and calculating the nn heads we get. We can invert this to obtain the probability of pp in terms of nn and NN. Recall that Bayes theorem tells us that we can invert the probability according to the following:

p(AB)=P(A)P(BA)P(BA)P(A)dAp(A | B) = \frac{P(A) P(B | A)}{\int P(B | A) P(A) dA}

Applying this to our binomial distribution of coin flips and assuming a flat prior on pp such that P(p)1P(p) \sim 1, we get:

P(pn,N)=P(p)P(np,N)P(np,N)P(p)dpP(p | n, N) = \frac {P(p) P(n | p, N)} {\int P(n | p, N) P(p) dp}

which, when we plug in the definition of the binomial distribution for P(np,N)P(n | p, N) gives us:

P(pn,N)=pn(1p)Nnxn(1x)NndxP(p | n, N) = \frac {p^n(1-p)^{N-n}} {\int x^n(1-x)^{N-n} dx}

where we have canceled out factors of N!n!(Nn)!\frac{N!}{n!(N-n)!}. We then make the following definitions:

α=n+1β=Nn+1B(x,α,β)=xα1(1x)β1dx\begin{aligned} \alpha &= n + 1 \\ \beta &= N - n + 1 \\ B(x, \alpha, \beta) &= \int x^{\alpha-1} (1-x)^{\beta-1} dx \end{aligned}

which gives us

Beta(pα,β)=pα1(1p)β1B(p,α,β)Beta(p | \alpha, \beta) = \frac {p^{\alpha-1} (1-p)^{\beta-1}} {B(p, \alpha, \beta)}

This is the definition of the Beta distribution. The interpretation is that we can use it to infer the distribution of the parameter pp of a binomial distribution given the counts of heads and tails.

The Beta distribution turns out to be the "conjugate prior" to the binomial distribution. Mathematically, this means that:

P(pα,β,n,N)=Binom(nN,p)Beta(pαβ)Binom(nx,N)Beta(xα,β)dx=Beta(pα,β)\begin{aligned} P(p | \alpha, \beta, n, N) &= \frac{Binom(n | N, p)*Beta(p | \alpha \beta)} {\int Binom(n |x, N) Beta(x | \alpha, \beta) dx} \\ &= Beta(p | \alpha', \beta') \end{aligned}

In other words, if we have a prior belief on pp that is represented by α\alpha and β\beta, and we measure nn heads of a total of NN from a binomial distribution, then our posterior belief in the distribution of pp is also a beta distribution with new parameters α\alpha' and β\beta', where

α=α+nβ=β+(Nn)\begin{aligned} \alpha' = \alpha + n \\ \beta' = \beta + (N-n) \end{aligned}

This means that that if we have a prior believe of pp represented by α\alpha and β\beta and we observe hh heads and tt tails, then our posterior is represented by α+h\alpha+h and β+t\beta+t. This makes the process of updating extremely easy, which is why it is so often used when performing inference on binomial distributed data.

Poisson Distribution

The Poisson distribution can be thought of as a generalization of the Bernoulli distribution. The Bernoulli distribution represents an event which can take on values of 0 or 1 and has an intrinsic rate of occurrence pp. The Poisson distribution, in contrasts, represents an event that can happen any integer number of times (0, 1, 2, 3...). The only requirement is that occurrences are independent: the occurrence or non-occurrence of an event cannot make repeat occurrence more or less likely.

Examples are the number of raindrops hitting your roof in a short period of time, or the number of popcorn kernels that pop in your microwave in a small interval, or the number of people who arrive at a bank between 3:00 and 3:15.

The Poisson distribution is a 1-parameter model, and the single parameter, λ\lambda, represents the expected number of event occurrences (usually in some unit time interval). The typical setup is to have some ground truth for how many events are expected to occur on average. For example, one could have measured over the past year the average number of people who walk into a bank in a 15 minute interval. Based on that average, and assuming that people's schedules are independent, one can assume that the number of people arriving at the bank in a given time interval is described by the Poisson distribution.

One way to derive the poisson distribution is to think of a period of time T, during which we expect on average λ\lambda events to occur, as consisting of many infinitesimal periods of time, δt\delta t. We assume that that the δt\delta ts are small enough such that the probability of more than 1 event occurring in each window is vanishingly small. We can therefore model each window as a Bernoulli random variable with probability λδt/T\lambda \delta t / T. The question we are asking is, "How many total events occurred in time TT" which, with these assumptions, becomes, "How many of these individual Bernoulli events resulted in success?".

This is the equivalent of a binomial distribution where we have MANY individual coin flips, each with a very small probability of being heads. In each small period of time, the probability of an event occurring is λδt/T=λ(T/N)/T=λ/N\lambda \delta t / T = \lambda (T/N) / T = \lambda / N

We therefore take the binomial equation binom(nN,p)binom(n | N, p) and make the replacements:

  • NN \rightarrow \infty
  • p=λ/Np = \lambda/N

Making these substitutions and taking the limit, one arrives at the formula for the Poisson distribution:

pois(nλ)=λneλn!pois(n|\lambda) = \frac{\lambda^n}{e^{-\lambda}}{n!}

which describes the probability of n events occurring in a time period during which we expect λ\lambda events on average.

Gaussian Distribution

Unlike the previous distributions, we will not motivate the Gaussian distribution from a specific scenario (such as flipping coins), but instead will motivate it from a more general fact known as the Central Limit Theorem. The Central Limit Theorem states that the distribution of the sum of MANY independent random variables, each having an arbitrary distribution (with some loose conditions on those distributions) will tend to a Gaussian distribution. For example, if I add together many Bernoulli variables, their sum will tend toward a gaussian distribution (we know that their sum is exactly a Binomial distribution, which also implies that a Binomial distribution tends toward a Gaussian distribution as NN \rightarrow \infty) .

This is a complicated theorem to prove, so we will here only assert it. However, it is very powerful and implies that Gaussian distributions will arise naturally in many situations. For example, consider a process that has many small errors. Even if we don't know the distribution of individual errors, if there are sufficiently many independent errors, their sum, the total error, will follow a Gaussian distribution. Therefore, it is common to model the errors whose true distribution is unknown as a Gaussian (since the total error will often be the sum of many individual errors).

The Gaussian distribution describes a 1-dimensional continuous variable, xx, as a function of two parameters: μ\mu and σ\sigma. The distribution is given by:

gauss(xμ,σ)=12σ2πe(xμ)22σ2gauss(x | \mu, \sigma) = \frac{1}{\sqrt{2\sigma^2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}

Using the formula above, one can show that the mean of the distribution is μ\mu and the standard deviation is given by σ\sigma. In fact, the parameters μ\mu and σ\sigma are often referred to as the "mean" and "standard deviation".

Because the gaussian distribution is so ubiquitous, it is worthwhile to spend some time to understand its properties. Imagine we draw nn points from a gaussian describing the random variable xx and take the mean of those measured points. We define that "sample mean" of xx to be xˉ\bar{x}, noting that this refers to the mean of a specific realized sample. The quantity xˉ\bar{x} is itself a random variable (as it is just a function of the data) and therefore it has a distribution that depends on the model and its parameters. One can mathematically show that the distribution of μs\mu_s is itself a gaussian and is given by

p(xˉμ,σ,n)=Gauss(μ,σn)p(\bar{x} | \mu, \sigma, n) = Gauss(\mu, \frac{\sigma}{\sqrt{n}})

It is a remarkable mathematical coincidence that the sample mean follows the same distribution family as the gaussian itself; this is not a property that other distribution functions will have.

The 1n\frac{1}{\sqrt{n}} part of that formula is important: it means that the standard deviation of the sample mean drops by 1/n1/\sqrt{n} as you draw more and more points. If you are trying to measure the value of μ\mu of a gaussian distribution, you can do so by observing many points drawn from it and calculating the sample mean of those points, xˉ\bar{x}. If you observe many points, that distribution of xˉ\bar{x} will be tightly centered around the expected sample mean, μn\frac{\mu}{\sqrt{n}}, and you can use that to infer the true value of μ\mu (we describe this in a loose sense here and will go into more precise detail in later sections).

Chi-Squared Distribution

The Chi-Squared distribution is closely related to the gaussian distribution.

Let's say we have a gaussian distribution: xgauss(x,μ,σ)x \sim gauss(x, \mu, \sigma) with known parameters μ\mu and σ\sigma. Define another random variable named z12z_1^2 by:

z12=(x1μ)2σ2z_1^2 = \frac{(x_1 - \mu)^2}{\sigma^2}

(By this, we mean that one produces a draw of the random variable named z12z_1^2 by drawing an instance of xx from the gaussian above, subtracting μ\mu, and dividing by σ2\sigma^2).

The distribution of our newly defined variable z12z_1^2 is known as the Chi-Squared distribution with "1 degree of freedom".

Imagine we instead draw two numbers from two independent gaussian distributions (or the same distribution twice, as long as the second draw is independent) and define the random variable:

z22=(x1μ)2σ2+(x2μ)2σ2z_2^2 = \frac{(x_1 - \mu)^2}{\sigma^2} + \frac{(x_2 - \mu)^2}{\sigma^2}

The distribution of this variable is known as the Chi-Squared distribution with "2 degrees of freedom".

To generalize this, imagine we consider N different gaussian distributions, each with known but possibly different values of μi\mu_i and σi\sigma_i, and each statistically independent from each other. We draw a value from each to form the set {xi}\{x_i\} and define:

zN2=i(xiμi)2σi2z^2_N = \sum_i \frac{(x_i - \mu_i)^2}{\sigma_i^2}

The distribution of zN2z^2_N is the Chi-Squared distribution with "N degrees of freedom". The mean of the Chi-Squared distribution with N degrees of freedom is N (simply enough). By construction, the distribution of the sum of two Chi-Square distributed variables is itself Chi-Square distributed:

χn2+χm2χn+m2\chi^2_n + \chi ^2_m \sim \chi ^2_{n+m}

The domain of the Chi-Squared distribution is all non-negative real numbers. The formula for the probability distribution function is given by:

χ2(x,n)=xn21ex/22n/2Γ(n2)\chi^2(x, n) = \frac {x^{\frac{n}{2} - 1}e^{-x/2}} {2^{n/2}\Gamma(\frac{n}{2})}

where Γ\Gamma is the Gamma Function.

Important in the definition of the Chi-Squared distribution is the requirement that the gaussians are all be independent. However, it is common to encounter statistical situations involving the sum of squares of gaussians that are not all independent, but instead depend on each other through the presence of one or more linear constraints on their values. A linear constraint is a fixed linear relationship between the values of these gaussian random variables, which typically takes the form of the sum of 2-or-more of the variables equaling some fixed value.

If I have n gaussian variables, g1,...,gng_1, ..., g_n, and I have m linear constraints on the values of these variables, then one can show that the sum of the squares of these variables is distributed as:

igi2χnm2\sum_i g_i^2 \sim \chi^2_{n-m}

In the typical language, the degrees of freedom of the chi-squared is given by the number of independent variables minus the number of constraints applied. This is known as Cochran's theorem, which more generally states that the sums of quadratic terms of gaussian variables can be expressed as the sum of terms where each term is distributed as a Chi-Squared, and the number of degrees of freedom of each chi-squared is the number of linearly independent combinations of the xix_i variables in that term.

The typical proof of this describes the linear constraint as a projection operator that maps the space of constrained-and-correlated gaussian variables to a sub-space in which the gaussian are uncorrelated. It can be shown that this projection preserves the sum of the squares in the original space (essentially because the trace of a matrix is invariant under orthogonal transformations).

Gaussian Distribution, continued

With the Chi-Squared distribution in hand, we can state another property of the gaussian distribution. Imagine that we draw n samples from a gaussian distribution with mean μ\mu and standard deviation σ\sigma. We then calculate the quantity:

Z2=1n(xiμ)2Z^2 = \frac{1}{n} \sum (x_i - \mu)^2

This is like a sample variance, but we are using the true mean, μ\mu, and not the sample mean. We can re-write this equation as:

Z2nσ2=(xiμ)2σ2\frac{Z^2n}{\sigma^2} = \sum \frac{(x_i - \mu)^2}{\sigma^2}

We see that quantity on the right-side of this equation is described by a Chi-Squared distribution with n degrees of freedom (as xix_i are drawn from a gaussian with mean μ\mu and standard deviation σ\sigma). We can then directly state that the distribution of Z2Z^2 is given by:

Z2σ21nχn2\frac{Z^2}{\sigma^2} \sim \frac{1}{n} \chi^2_{n}

Note that, in the above equation, we are assuming that we know the true standard deviation σ\sigma. Thus, if we know μ\mu and σ\sigma, we can calculate the distribution of Z2Z^2.

More useful, however, is an expression for the distribution of the sample variance, var(x)var(x), which we define as

s2(x)=(xixˉ)2n1s^2(x) = \frac{\sum (x_i - \bar{x})^2}{n-1}

To obtain this distribution, we start with the quantities

Ui=xiμσU_i = \frac{x_i - \mu}{\sigma}

and create the following expression:

Ui2=(xiμσ)2=xixˉσ2+N(xˉμσ)2\sum U_i^2 = \sum(\frac{x_i - \mu}{\sigma})^2 = \sum \frac{x_i - \bar{x}}{\sigma}^2 + N (\frac{\bar{x} - \mu}{\sigma})^2

This final quantity is the sum of quadratic terms in xix_i, where the xix_i are independent. By Cochran's theorem, described above, we can show the following important fact:

  • (xixˉ)2/σ2\sum(x_i - \bar{x})^2 /\sigma^2 is distributed as a Chi-Squared with n-1 degrees of freedom
  • N(xˉμ)2/σ2 N (\bar{x} - \mu)^2 / \sigma^2 is distributed as a Chi-Squared with 1 degree of freedom
  • These two quantities are independent of each other

Using the first bullet point above and the definition of s2s^2, we can see that

s2σ2(n1)χn12s^2 \sim \frac{\sigma^2}{(n-1)} \chi^2_{n-1}

And since the term in the second bullet point is a function of the sample mean and is independent of the term in the first bullet point, we learn that s2s^2 is independent of the distribution of the sample mean. The fact that the sample mean and sample standard deviation are independent is unique to the gaussian distribution and, in fact, fully specifies the gaussian, a property known as Basu's theorem. These facts will be important when performing inference on the gaussian distribution (trying to infer σ\sigma and μ\mu given a sample of gaussian-distributed data). This will be discussed in a later section.

To summarize, if we draw n points from a gaussian distribution, the distributions for the sample mean xˉ\bar{x} and the sample variance s2s^2 are given by:

xˉgauss(μ,σn)s2σ2(n1)χn12\begin{aligned} \bar{x} \sim gauss(\mu, \frac{\sigma}{\sqrt{n}}) \\ s^2 \sim \frac{\sigma^2}{(n-1)} \chi^2_{n-1} \end{aligned}

and these are independent of each other.

Student's t-distribution

Imagine we have a gaussian-distributed variable ZZ and a Chi-Square distributed variable VV with NN degrees-of-freedom, where ZZ is independent of VV. We define the student's t distribution as the distribution of the quantity:

t=ZV/Nt = \frac{Z}{\sqrt{V/N}}

The canonical example motivating this distribution is similar to the example motivating the Chi-Squared distribution. Imagine that we have a single gaussian distribution with true mean μ\mu and true standard deviation σ\sigma and we draw n points from that distribution, x1x_1, x2x_2, ..., xnx_n. From the above section, we know that:

  • The quantity NxˉμσN\frac{\bar{x} - \mu}{\sigma} is gaussian distributed
  • The quantity (xixˉ)2σ2\sum{\frac{(x_i - \bar{x})^2}{\sigma^2}} is Chi-Square distributed with N-1 degrees-of-freedom
  • The distribution of these two quantities is independent

Therefore, by the definition of the student's t distribution, we know that the quantity:

t=xˉμσ/N(xixˉ)2σ2/(N1)t = \frac{\frac{\bar{x} - \mu}{\sigma/\sqrt{N}}}{\sqrt{\sum{\frac{(x_i - \bar{x})^2}{\sigma^2}}/(N-1)}}

follows the student's t distribution. We can then define the sample variance as

s2=1n1(xixˉ)2s^2 = \frac{1}{n-1} \sum (x_i - \bar{x})^2

and cancel out the factors of σ\sigma in the definition of tt to obtain:

t=xˉμs2/Nt = \frac{ \bar{x} - \mu}{\sqrt{s^2 / N}}

which, by construction, follows the student's t distribution with n1n-1 degrees of freedom. The important aspect of this quantity is that it depends on the true mean μ\mu but does not depend on the true standard deviation σ\sigma (it canceled out above). We will later show that we can use this test statistic to perform inference on the true mean μ\mu without knowing or assuming the true standard deviation σ\sigma (we only need to assume that the underlying distribution is a Gaussian).

The probability distribution for the student's t distribution can be calculated by starting with the PDF distributions for a gaussian and for a chi-squared and applying the laws of probabilistic transformation, but we will not do so here. A student's t-distribution is shaped like a gaussian, but has larger tails. The standard interpretation of the longer tails is the fact that we are using the sample mean and not the true mean in its definition, which adds additional "uncertainty" to the distribution.

F-Distribution

The F-Statistic is a random variable that can be generated from two independent Chi-Squared distributed variables. Given U1U_1 which follows a Chi-Squared distribution with d1d_1 degrees of freedom and U2U_2 with d2d_2 degrees of freedom (with the two distributions independent), we define the F-Distribution (with degrees d1d_1 and d2d_2) as the distribution of the random variable FF given by:

F=U1/d1U2/d2F = \frac{U_1/d_1}{U_2/d_2}

Equivalently, one can define an F-statistic from Gaussian distributions. Imagine we have a gaussian distributed variable g1g_1 with parameter σ1\sigma_1 and another independent variable g2g_2 with parameter σ2\sigma_2, and we draw n1n_1 values from g1g_1 and n2n_2 values from g2g_2. Letting s12s_1^2 be the sample variance of the draws from g1g_1 and s22s_2^2 be the sample variance from the draws from g2g_2, then the following quantity follows the F-Distribution (with n11n_1-1 and n21n_2-1 degrees of freedom):

F=s12/σ12s22/σ22F = \frac{s_1^2/\sigma_1^2}{s_2^2/\sigma_2^2}

Note that the F distribution is NOT symmetric in terms of its parameters d1d_1 and d2d_2. The domain of FF is from 0 to infinity.

The mean of an F distribution with d1d_1 and d2d_2 is given by:

Fˉd1,d2=d2/(d22)\bar{F}_{d_1, d_2} = d_2 / ( d_2 - 2 )

and its variance is given by:

var(Fd1,d2)=2d22(d1+d12)d1(d22)2(d24)var(F_{d_1, d_2}) = \frac{ 2 d_2^2 ( d_1 + d_1 - 2 ) } { d_1 ( d_2 - 2 )^2 ( d_2 - 4 ) }

The most common applications of the F-distribution are determining if adding variables to a regression improves it with statistical significance or determining if data divided into many groups comes from a single distribution across groups (the problem ANOVA is attempting to solve). We will discuss these in a later section.