Confidence Intervals

We previously discussed the concept of estimators as a way to find a value that approximates an unknown parameter. A drawback of using an estimator is that it doesn't provide a level of uncertainty on the estimate. A more general way to estimate the value of a parameter is by calculating a "confidence interval".

A confidence interval answers the question: "For a given measured dataset, what is the set of points of the parameter which are supported by the data (or, conversely, which points are rejected by the data)?" It does this in a mathematically rigorous, but somewhat confusing way. It's worth pointing out that a confidence interval is only a single example of one thing that a person could do when attempting to estimate a parameter. It is a procedure, motivate by statistical intuition, that can be a useful took when thinking about a problem or making a decision. But how one chooses to use data or what one wants to calculate all depend on one's purposes, the level of rigor necessary, and one's philosophy. For now, let's assume that calculating a confidence interval is what we'd like to do here.


Formally, a confidence interval is a procedure for calculating an interval in parameter space based on observed data that has a fixed probability of containing the "true" value of the parameter. It is important to note that the interval is based on the data, meaning that the interval itself is a random variable (different data draws will produce different intervals). In a frequentist framework, the true value of a parameter is NOT a random variable: it is some fixed (but unknown) value. Therefore, a frequentist does not talk about the probability that the true value is in some fixed interval. Instead, they talk about the probability that an interval contains the true value. For any fixed interval, it either DOES contain the true value or DOES NOT.

The probability that the confidence interval contains the true value of the parameter of interest is known as the "size" of the interval, and is usually chosen in advance of calculating the interval. For example, a 95% confidence interval will contain the true value of a parameter 95% of the time (depending on what observed data is generated by the model).

It isn't immediately clear how to translate the definition of a confidence interval into an actual calculation. The definition is more of a description of the properties that an interval must have and not a prescription of how to create an interval with such properties. Let's walk through how to actually translate this description into an actual interval.

Neyman Construction

Calculating a confidence interval means coming up with a procedure that will generate an interval which contains the true value of the parameter of interest some pre-determined percentage of the time. The trouble is that, since we don't KNOW that true value, we must ensure our procedure works for ALL possible values of the parameter, knowing that ANY of them could be the true value. That's the challenge of a confidence interval. While this seems initially foreboding, it turns out that it can be calculated using a relatively simple method which creates intervals having this property "by construction".

Let's assume we're working in the simple case where we have a single parameter $\theta$ described by a probability distribution function $p$:

The data $x$ may be multi-dimensional. However, we are free to choose a function of the data (and possibly the parameter of interest) whose value is 1 dimensional. This is known as a test statistic: $t = t(\vec{x}, \theta)$ with $t \in \mathbb{R}$. We note that the test statistic may be defined in terms of the value of the parameter of interest for a hypothetical model. We will discuss the details of test statistics and how to choose a "good" test statistic later.

The procedure for producing a confidence interval is the following:

  • Choose a "test statistic" $t(x)$
  • For each possible value of the parameter $\theta$, determine the distribution of the test statistic: $p(t(x, \theta) | \theta)$.
  • For every value of $\theta$, using the distributions above, choose a window in the domain of the data ($t \in [a, b]$) such that the total probability of that window is the size of the confidence window: $\int_a^b p(t | \theta) dt = \alpha$. Record these windows $[a(\theta), b(\theta)]$ for each value of $\theta$.
  • Measure the data $x$ and calculate the measured value of the test statistic $t_0 = t(x)$
  • Find all values of the parameter $\theta$ such that $t_0$ falls in the windows $[a(\theta), b(\theta)]$ that we chose above. In other words, determine the set ${\theta : a(\theta) < t_0 < b(\theta)}$.
  • This set of values of $\theta$ forms the confidence interval on the parameter $\theta$ of size $\alpha$.

This procedure is known as the Neyman Construction. The set of all points that fall in the union of each window $[a(\theta), b(\theta)]$ for all values of the parameter $\theta$ is sometimes called the Confidence Band, as it traces out a 2-d sheet in parameter/data space.

Why does an interval, defined in this way, have the properties of the confidence interval that we desire (namely that it contains the true value of $\theta$ with probability $\alpha$)? Well, we start with the fact that there is only one TRUE value of $\theta$: $\theta_{true}$, which is of course unknown to the person performing the experiment. Given that fixed value of $\theta_{true}$, there is a true distribution of t: $p(t) = p(t | \theta_{true})$. In our procedure to generate confidence intervals, we picked windows $[a(\theta), b(\theta)]$ for all possible values of $\theta$ where the probability of t falling into that window is exactly $\alpha$. In particular, this is true of $\theta_{true}$.

The key point is that the set ${\theta : a(\theta) < t_0 < b(\theta)}$ will contain $\theta_{true}$ if and only if the measured value $t_0$ falls between $[a(\theta_{true}), b(\theta_{true})]$. But, by construction, $t_0$ falls into this window $\alpha$ of the time (because we defined the window to have total probability of t equal to $\alpha$). Therefore, the confidence interval defined by ${\theta : a(\theta) < t_0 < b(\theta)}$ will contain the true value of the parameter $\theta_{true}$ a fraction of the time equal to $\alpha$. This is the property that we are looking to show, which proves that our procedure produces confidence intervals.

There is one loose end remaining. When picking the window $[a, b]$ for a given value of $\theta$ (knowing the distribution $p(t | \theta)$), we required that it must have a total probability of $\alpha$. However, there are many windows $[a, b]$ that contain that total probability (infinitely many for a continuous parameter). Any choice of values of $[a, b]$ will create confidence intervals, as our proof above didn't depend on any particular choice of $[a, b]$. So, would could then wonder, given a distribution, what particular range of $[a, b]$ should one choose (given that it must contain a total probability of $\alpha$)?

To help inform the answer, recall that the goal of a confidence interval is to give bounds that are useful in considering the true value of a parameter. Intuitively, the smaller those bounds are, the more useful they are in constraining that parameter. Therefore, one possible way of determining which intervals are valuable are those that are smallest. It is relatively straight-forward to show that those intervals which contain the highest values of $p(t | \theta)$ will be the shortest (the higher the values of $p(t | \theta)$ you choose, the fewer values of $\theta$ you must integrate over to obtain $\int_a^b p(t | \theta) dt = \alpha$). As an alternative, one may choose to pick intervals that are symmetric around some value (though, this will not be possible in general). One may also choose to start from $-\infty$ until the distribution integrates to $\alpha$. There are many choices, and any particular choice should be considered in the context of the problem you are solving or the type of statistical statement you are trying to make. The choice that one makes here is known as an "ordering rule", since it determines the order you move in through the space of data while accumulating points to determine the interval $[a, b]$. And difference choices will produce different confidence intervals that have different properties. We will discuss the details of this later.

The Neyman Procedure, as described above, is in many ways an extension of the Neyman-Pearson procedure of solving the Simple Hypothesis problem, as described above. One can define similar properties to confidence intervals as on can to hypothesis tests:

  • A "Type 1" error in the context of a confidence interval is when the confidence interval does not contain the true parameter. From the definition of the confidence interval, it follows that the probability of a "Type 1" error is the "size" of the confidence interval, or $\alpha$.
  • Power, however, doesn't have a direct analog in the context of Confidence Intervals, unless one fully specifies how a confidence interval is to be used to perform a hypothesis test. This is because there is no single alternate hypothesis, but many different alternate models (one for every possible value of our parameter, $\theta$). Any confidence interval will contain an infinite number of parameter values $\theta$ that are not equal to $\theta_{\text{true}}$. So, in one sense, every confidence interval makes a "Type 2" error by including incorrect parameter values. But, intuitively, confidence intervals that are thinner for many possible values of $\theta_{\text{true}}$ are "better", and one can interpret size of the confidence interval as a loose analogy to "power".

Continuing with comparison to hypothesis tests, one can interpret the Neyman procedure for producing confidence intervals as finding the set of all parameter points that are not rejected when performing a p-value test on the model associated with that parameter point (given the observed data). In this sense, a confidence interval can be thought of as an "inverted hypothesis test".

Let's show that this gives us the same procedure as described above. Imagine that we measure a value of our data $t_0$ and we want to perform a p-value test on the model defined by $\theta$ using a threshold of $1-\alpha$. We would reject the parameter point $\theta$ if the integral of the probability of $t$ over points more extreme than $t_0$ is less than $1-\alpha$. This is the equivalent of asking, "Does the measured value $t_0$ fall inside a window of size $\alpha$" (assuming that window corresponds to the region of data that is not considered extreme). This region will correspond to the slice of the Confidence Band if the band is defined with an ordering principle that matches the definition of "extreme" used in the p-value test.

For example, if we create one-sided confidence intervals (starting from $-\infty$), that will be the equivalent of performing a 1-sided p-value test on all possible values of the parameter given some measured dataset.

Calculating Intervals

The above procedure required on being able to do the following:

  • Determine the distribution $p(t | \theta)$ for all values of $\theta$
  • Integrate that distribution and find values $[a, b]$ such that the integral of that distribution sums to $\alpha$

If $\theta$ is a continuous parameter, this may seem to be a daunting task. One could calculate this in a "brute-force" way by picking discrete values of $\theta$ and, for each value in this grid, generating values of x using the model $p(x | \theta)$ and then building up a distribution for $p(t | \theta)$. One could then use computational techniques to integrate this distribution and obtain windows $[a, b]$ for each value of $\theta$, which would then be inverted to form confidence intervals.

Thankfully, there are a number of examples that have properties which allows one to calculate a confidence interval without resorting to a brute force procedure. For example, if can choose a test statistic $t(x, \theta)$ such that the distribution $p(t(x, \theta) | \theta)$ is independent of $\theta$, then one does not need to pick discrete points in $\theta$ and generate intervals $[a, b]$ separately for each point. One can do this once since the distribution $p(t(x, \theta) | \theta)$ and therefore the intervals $[a, b]$ are independent of $\theta$. Further, if one can analytically integrate $p(t(x, \theta) | \theta)$ (or at least leverage well-studied lookup tables), then one doesn't need to perform manual integrals oneself. Examples that satisfy this criteria are therefore much simpler to use. In fact, MOST examples of confidence intervals that appear in textbooks have this property, which is nice, but I find that it hides the more general procedure that one needs to take to calculate confidence intervals for arbitrary models.

Gaussian Example

As a simple but useful example, let's calculate confidence intervals for the parameter $\mu$ of a 1-d gaussian distribution (assuming that $\sigma$ is known and fixed).

To start, we need to determine our test statistic and our ordering rule. For the simple 1-d case, it makes sense to simply take the value of $x$ as our test statistic: $t(x) = x$. And, since this PDF is symmetric around $\mu$, it seems reasonable to choose confidence intervals that are centered around $\mu=x$. This is also equivalent of picking confidence intervals of the minimum width (by including the highest-values of the likelihood first).

We then have to find a window $[a, b]$ in x, for every value of $\mu$, such that:

Since we want the interval $[a, b]$ to be symmetric about $x=\mu$, we can re-write this as the following:

where we assume the general case that we will need a different value of $\delta$ for every possible value of $\mu$. The integral of a gaussian is a well-studied quantity, and therefore we can re-write the integral and leverage terms of well-known error function. Let's assume that:

with $\Delta$ an unknown constant and perform a change of variables:

giving us

The dependence of $\mu$ has dropped out, leaving us with the equation

which we can invert to get:

Our confidence interval is therefore:

The key to being able to calculate this analytically was that the dependence of $\mu$ dropped out after a well-chosen change of variables. In fact, we could have saved ourselves a lot of trouble had we defined our test statistic from the beginning as being:

since we already know that the distribution of this variable is independent of both $\mu$ and $\sigma$. This emphasizes an important point: choosing the right test statistic can make a confidence interval calculation much easier.

Test Statistics

As seen above, an important step in the building of confidence intervals is the choice of the test statistic. A test statistic is a value can be calculated as a function of a given dataset and a given model (or model parameters) associated with that dataset. The definition of a test statistic may be defined in terms of assumed values of parameters. As we've seen above in the Gaussian example, defining a function in terms of the parameter of interest can lead to it having useful properties, specifically that it's distribution is independent of the parameter of interest. While some references may present the definition of a test statistic as being defined only in terms of data, doing so seems to contradict the clearly useful technique of including assumed values of parameters of interest, so we will here not attempt to present the definition so strictly.

A test statistic is useful for inferring the value of a parameter of interest if:

  • It can be easily calculated
  • Its distribution (given a fixed model or model parameters) is known
  • It or its distribution must depend on the parameter of interest
  • Its distribution does not depend on any unknown (nuisance) parameters

For a given model, there are many possible test statistics. So, how should one go about picking the "best" test statistic, or what does it even mean for a test statistic to be good? Intuitively, a test statistic is good if it's distribution under different values of the parameter(s) of interest varies dramatically. This allows one to use the test statistic to determine which model is most likely the "true" model (again, this loose language will be made more formal later).

There are a number of types of tests statistics which have useful properties for inference.

A "pivotal quantity" is a test statistic whose distribution does not depend on the parameter of interest. As noted above, the most common example is for a gaussian distribution, where the distribution of the quantity:

is $gauss(q, 0, 1)$ and does not depend on $\mu$. As exemplified above, if one can define a pivotal quantity in terms of a parameter of interest (such that the pivotal quantity is defined in terms of the parameter of interest but whose distribution is independent of the parameter of interest), then calculating confidence intervals is easy.

A "sufficient statistic" is a test statistic that, for the purpose of inferring a given parameter, contains as much information as the raw data. Specifically, a statistic $t$ is sufficient if, given fixed values of $t$ and the parameter of interest $\mu$, the distribution of the data $x$ does not depend on $\mu$. A statistic is sufficient if and only if the likelihood can be written as:

where x is the data, $\mu$ is the parameter of interest, and $t(x)$ is our test statistic. This means that the part of the likelihood that depends on the parameter of interest $\mu$ only depends on x via $t(x)$. Intuitively, the $h(x)$ term gives us no information about $\mu$, so we can ignore it. In other words, for the purpose of performing inference on the $\mu$, one does not lose any information by summarizing the data with the value of the test statistic $t$.

As an example, consider the gaussian distribution for a set of iid measurements $x_1...x_i$. We can write the likelihood as:

but we can also write it as:

If our parameter of interest is $\mu$, then we can see that only the first exponential depends on $\mu$, and it depends on the data $x$ only via $t = \bar{x}$. Therefore, if we are performing inference on $\mu$, it is sufficient to know only $\bar{x}$.

More generally, when inferring parameters $\mu_1...\mu_i$, if I can write the likelihood as:

then the set ${t_1...t_i}$ are sufficient statistics for inferring $\mu_1...\mu_i$.

In contrast, an "ancillary statistic" when considering a parameter $\mu$ is a function of the data whose distribution does not depend on the parameter $\mu$. In other words, measuring the value of an ancillary statistic doesn't give you any power to constrain the true value of the parameter $\mu$.

Approximate Confidence Intervals

As demonstrated above, confidence intervals for some models can be calculated exactly, as the model's pdf can be inverted analytically (or, at least the inversion can be arbitrarily approximated by a computer or a table). However, this is not the case for many real-world problems. Fortunately, there is a useful approximations that allows one to calculate approximate confidence intervals for a wide range of problems.

Approximate Confidence Intervals of Maximum Likelihood Estimators (Wald Tests)

When creating a confidence interval for a maximum likelihood estimator, one can obtain an approximate interval by using the asymptotic distribution of MLEs.

Recall that, for a pdf representing iid data, the MLE of a parameter, for large N, is approximately gaussian distributed:

where $\sigma_\theta$ is related to the Fischer Information of the true distribution:

If we knew $FI(\theta)$, then we could use our knowledge of the gaussian distribution to create asymptotically accurate confidence intervals. However, in most cases, one doesn't know the true distribution, and therefore one cannot exactly calculate $FI(\theta)$. One can instead estimate the Fisher Information by calculating it using estimates of unknown parameters obtained from the measured data: $FI(\hat{\theta})$. This process of estimating unknown parameters of a distribution from an observed dataset and using that distribution to perform inference is known as performing a "Wald Test".

In this case, one can use the measured Fisher Information to calculate the approximate shape of the distribution of the $\hat{\theta} - \theta$, which we know will follow a gaussian distribution asymptotically. One can then use the known properties of the gaussian distribution to perform inference.

One often performs a Wald Test by considering the squared quantity: $(\hat{\theta} - \theta)^2$. Since $\hat{\theta} - \theta$ approximately follows a gaussian, $(\hat{\theta} - \theta)^2$ will follow a Chi-Squared distribution with 1 degree of freedom. One can use the known properties of the Chi-Squared distribution to perform inference.

In general, the Wald test uses a rough approximation and often yields inaccurate results. In particular, if one uses an approximation to the variance of the MLE estimator, the rate of convergence of the approximation will decrease, yielding a less accurate result. However, the simplicity of the test, and the fact that it can be quickly calculated, makes it commonly used.

An important point to note is that these intervals will be symmetric around the MLE, as they assume that the distribution of the MLE is a gaussian, which is a symmetric distribution.

Approximation via Wilks Theorem

Another approximation that is commonly used to obtain confidence intervals a result known as Wilks Theorem, which can be used to obtain an asymptotic distribution for a quantity known as the likelihood ratio. Statistical tests using this result are often referred to as Likelihood Ratio tests.

Consider the likelihood of a single parameter model, $L(x | \theta)$ (the data $x$ may be multi dimensional or arbitrarily complex). As above, we define $\hat{\theta}$ as the value of $\theta$ that maximizes the likelihood function. We now define the likelihood ratio as the following function:

It is a function of the parameter $\theta$ and can be thought of as the ratio between the likelihood's maximum value and it's value at any point $\theta$. It is important to note that likelihood ratio $lr(\theta)$ is itself a random variable: For a fixed model and a fixed choice of $\theta$, the value of the likelihood ratio will vary across different datasets $x$ (as will the best fit $\hat{\theta}$ in $L(x | \hat{\theta})$).

We will also define the negative log likelihood ratio as:

In the case of a likelihood describing N iid samples (with large N), a result known as Wilks' Theorem tells us that the distribution of the $nllr$ evaluated at the true (unknown) value of $\theta=\theta_{true}$ (and assuming that the model associated with $\theta_{true}$ is true) follows a Chi-Squared distribution:

where $m$, the number of free parameters, here is equal to 1 because $\theta$ is the only unknown parameter. In other words, assuming $\theta_{true}$ is the true value of $\theta$, and if you generate datasets with using the distribution $p(x | \theta_{true})$ and calculate $nllr(x | \theta_{true})$ on each of those datasets, the values ${nllr(x_1 | \theta_{true}), nllr(x_2 | \theta_{true})...}$ will be distributed as $\chi^2_1$. Note that this Chi-Squared distribution does NOT depend on the value of $\theta_{true}$ that we chose (this remarkable result is why Wilks theorem is useful).

Thus, we can use the function $t = 2*nllr(x, \theta)$ as a test statistic to generate confidence intervals on $\theta$. The fact that the distribution of the test statistic $nllr(x, \theta=\theta_{true})$ under the assumption that $\theta=\theta_{true}$ doesn't depend on the value of $\theta_{true}$ makes $nllr(x, \theta)$ a pivotal quantity (asymptotically). Therefore, we can use the $\chi^2$ distribution directly to lookup confidence intervals using the following procedure:

  • Pick a size $\alpha$
  • Using the $\chi^2$ distribution, find the values $[0, nllr_{\text{critical}}]$ such that$\int_0^{nllr_{\text{critical}}}p(nllr)d(nllr) = \alpha$
  • Invert $nllr(\theta) < nllr_{\text{critical}}$ to get all values of $\theta$ for which the probability of $nllr(\theta) > \alpha$

These values of $\theta$ form the confidence region. This works if one can invert the $nllr$ to find the values of the parameter $\theta$ such that $nllr(\theta) = nllr_{\text{critical}}$. One can do this inversion analytically (if they know the functional form of $nllr(\theta)$). Or, if one does not have access to the analytic form of $nllr$, or if it's not readily invertible, one can simply scan over values of $\theta$ until they find a $\theta_{\text{critical}}$ such that $nllr(\theta_{\text{critical}}) = nllr_{\text{critical}}$.

This procedure leads to a nice trick. Because $\int_0^2\chi^2(x)dx = 0.683$, we know the critical value of the test statistic for a 68.3% confidence interval is 1.0. Therefore, the bounds of the confidence region are those values of $\theta$ for which

Using the above, we can directly see that the critical value of $\theta$ for a 68.3% confidence interval is the one such that: $\frac{log(x | \theta_{\text{critical}})}{log(x | \hat{\theta})} = \frac{1}{2}$. Thus, to find a 68.3% confidence interval for the parameter $\theta$, simply find the value of $\theta$ that maximizes $nllr$ (the center of the confidence interval), note $nllr_{max}$ at that point, and scan along values of $\theta$ until the value drops to half of that maximum value. That value is the boundary of the confidence interval.

Approximations using the Profile Likelihood Ratio

Above, we used Wilk's Theorem to generate confidence intervals using an asymptotic distribution of the likelihood ratio. We did so in the case of a single parameter of interest $\theta$. Often, however, one has a model of many parameters, all of which are unknown, but one may only be interested in obtaining confidence intervals of one of those parameters. We refer to the single parameter that we'd like to perform inference on as the Parameter of Interest, and all other unknown parameters are called Nuisance Parameters. We can write our likelihood in terms of these:

where $x$ is our data, $\theta$ is our parameter of interest, and $\vec{\nu}$ represents our nuisance parameters. We cannot directly use Wilks theorem above to obtain confidence intervals on $\theta$ if we don't know or assume the values of the nuisance parameters $\vec{\nu}$.

The maximum likelihood estimators for this likelihood are denoted as $\hat{\theta}$ and $\hat{\vec{\nu}}$. These represent the values of $\theta$ and $\vec{\nu}$ that simultaneously and unconditionally maximize $L$. Imagine, instead, that we were to fix a value of $\theta$ to be $\theta_0$ but to then maximize the likelihood across $\vec{\nu}$ conditional on $\theta=\theta_0$. The values of the nuisance parameter that maximize that form of the likelihood may be different from the values that maximize the unconditional likelihood. We denote these as $\hat{\hat{\nu}}$.

Given that definition, we define a quantity known as the profile likelihood ratio as the following:

In other words, we do the following:

  • Find the global maximum of the likelihood across all parameters
  • Fix $\theta$ to be some test value and find the values that maximize $\nu$ given that test point
  • Calculate the likelihood for the test $\theta$ and $\hat{\hat{\nu}}$, the values that maximize the likelihood over the nuisance parameters at the test point $\theta$
  • Take the ratio of these two quantities.

When there are no nuisance parameters $\nu$, this simply reduces to the standard likelihood ratio.

An extension of Wilks' theorem states that the distribution of $-2 log(\lambda(\theta))$, given that $\theta$ is the true value that generated the data in question, asymptotically converges to a Chi-Squared distribution:

or simply

where $s$ is the number of parameters of interest, here only 1 (corresponding to $\theta$). The important point is that the distribution of $\lambda$ is independent of the true values of the nuisance parameters. Thus, the profile likelihood ratio is a pivotal quantity whose distribution we know (asymptotically). Therefore, we can invert it to create confidence intervals on the parameter of interest.

This is a pretty remarkable result. The likelihood in question could be arbitrarily complicated and contain many nuisance parameters, and yet (given enough data) we can find a distribution that depends only on the parameter of interest and that has a simple, well-studied analytic form.

In contrast to the Wald intervals above, the intervals obtained from the profile likelihood ratio may be asymmetric, depending on the actual form of the likelihood. More generally, they will better respect the actual bounds of the likelihood (since they are not making a local-gaussian assumption).