Kalman Filter
A Kalman Filter is a valuable tool for many real world problems, including modeling timeseries data and predicting the movement of physical objects or agents in an environment. The goal of this post is to derive the logic and mathematics behind the Kalman Filter from first principles, using only elementary statistics and linear algebra.
At it's heart, like many data tools, a Kalman Filter is nothing more than a statistical model that relates observed data to unknown parameters that are to be estimated. However, unlike simpler models that are built around a single data measurement, a Kalman Filter is designed to be run multple times using a stream of data measurements, each obtained at a different time.
Simple Example
To understand the logic of a Kalman filter, we will will start by deriving a simple version from a basic premise. Imagine we have a car that is confined to a one-dimensional track (it may be on rails, for example). The car moves frictionlessly with a constant (but initially unknown) velocity alone the rails. At specific time intervals, we'd like to know the position of the car. To allow us to do so, the car has a radar system which we can query. The radar system at time $t_i$ returns an estimate of the car's position at that time, but that position estimate has a uncertainty associated with it (which is fixed and known).
We would like to be able to predict the car's position and velocity at any given time using measurements we've obtained up through that time. As new measurements come in, we build up a better understanding of where the car is, where it was, and where it is going. Naively, we would only need two measurements to know the car's position and velocity. A measurement gives us data on the car's position and velocity, but since each measurement has uncertainty, we can't learn the car's true position or velocity from a single measurement (or from any number of measurements). Instead, however, by gathering more measurments, we can get a better and better estimates of it's true position and velocity through time.
They key point, the one that allows us to improve our estimates over time with subsequence measurements, is the fact that that we know the dynamics of the car. We know that the car moves with constant velocity, and knowing this allows us to relate previous measurements (at earlier positions) to measurements of our current position. This means that we can use all of our past measurements to understand the current position of the car, which means our understanding of the position will become more accurate over time (as we acrue more and more measurements). If we didn't have a model for how the car moves, previous measurements and the estimates produced from them wouldn't be useful for estimating the position at future times.
To make this concrete, let's introduce some specific variables and notation to tackle this problem. There are only two variables that describe the state of the car: it's position at time $t_i$ which we will denote $d_i$ and it's velocity at the same time, which we will denote as $v_i$. Collectively, these two data points represent the "state" of the car, which we will denote as:
(Note that we use "x" to represent the full state, not just the position).
All the measurements that we're going to make are "noisy", which means we'll never be able to say with certainty what the exact state of the car is. Instead, the best we can do is to build up a distribution of the car's state (and hope that the distribution becomes more accurate as we make further measurements). Thus, our procedure boils down to:
- Start with some initial distribution representing our knowledge the state of the car
- Make a measurement
- Update our distribution for the car's current state
- Infer how that state evolves over the time before the next measurement
- Repeat iteratively at each time step
Let us denote our understanding of the state of the car with a probability distribution $p_i(\vec{x_i})$. In other words, at any time $t_i$, we have some probability distribution function $p_i$ that describes what we think the distribution of the car's state, $x_i$, at that time may be. Note that this is a 2-d distribution that describes both the position and velocity of the car.
Here, we will here make our first simplifying assumption. We will assume that our knowledge of the car's state is initially a gaussian distribution:
This should be thought of as a 2-dimensional gaussian (or a multi-dimensional gaussian in the general case) with means $\vec{\mu_0}$ and uncertainties/correlations $\overleftrightarrow{\Sigma_0}$. (Note that $\overleftrightarrow{\Sigma_0}$ is a matrix and therefore may include off-diagonal terms, which represent correlations across state variables.). We will later show that, after we make measurements, the distribution of the car's state will remain a gaussian, which we will write as:
Because the distribution remains a gaussian after measurements, we can fully represent the our knowledge of the state of the car only using the vector $\vec{\mu_i}$ and the matrix $\overleftrightarrow{\Sigma_i}$. Thus, we can reduce the entire problem to determining how to update $\vec{\mu_i}$ and $\overleftrightarrow{\Sigma_i}$ at new times and with new measurements. The question then becomes, "If we think the state of the car at time $t_i$ is a gaussian distribution described by $\vec{\mu_i}$ and $\overleftrightarrow{\Sigma_i}$ and we make a new measurement at time $t_{i+1}$, what are the updated values $\vec{\mu}{i+1}$ and $\overleftrightarrow{\Sigma}{i+1}$ that represent the new distribution of the state of the car?"
There are two ways that the car's state will change between time $t_i$ and $t_{i+1}$:
- The car will moved based on the known dynamics of the system (drifting along the track with constant velocity)
- The new meaturement at time $t_{i+1}$ will provide additional information about the car's position
We must account for both of these when updating our knowledge of the state's distribution. For the update due to the car's dynamics, we know the state evolves a deterministic way:
which, using our state-matrix notation, we can represent as a transition matrix:
This transition matrix tells us how to update a single state $\vec{x}_i$, but we can also use it to update a state vector distribution. Remember, since we are assuming that our state distribution is a gaussian, we only need to figure out how to update the means and covariance matrix of the gaussian based on the car's dynamics (as described by the transition matrix $\overleftrightarrow{T}$). The update rule follows from known matrix algebra and is given by:
After taking into account how the system evolves based on its underlying dynamics, we then make a measurment and have to update the system using that new information. We will here make another key assumption here and assume that the uncertainties on the measurement are gaussian (with a mean of 0 and a known covariance matrix). Specifically, let's represent the measured state at time $t_{i+1}$ as $\tilde{\vec{x}}$ (the tilde represents a measured value), and let's assume that this measurement has an uncertainty/covariance given by the constant (known) matrix $\overleftrightarrow{\sigma}$. We assume that, given a true underlying state $\vec{x}$, the measured state follows the following distribution:
In other words, the measured state values, $\tilde{\vec{x}}$, are a gaussian around the true (but unknown) state values $\vec{x}$. So, given this measurement, how do we update our knoledge of the state? We can do so using Bayes' theorem, which tells us that:
or
Applied to our specific problem, Bayes' theorem says that we update the state of the system as:
where $p(\tilde{\vec{x}} | \vec{x})$ is the probability of measuring the state $\tilde{\vec{x}}$ given the true state is $\vec{x}$ (which is a gaussian distribution based on our assumption above) and $p(\vec{x})$ is the "prior" probability of the state $\vec{x}$. This prior probability at time $t_i$ is given by the distribution of the car's state at time $t_i$ but updated using the dynamics of the car. This is the key point: when incorporating the measurement into our knowledge of the system at time $t_{i+1}$, we take our previous best knowledge of the system at time $t_i$, update it using the dynamics that happen between $t_i$ and $t_{i+1}$, and the update THAT distribution using our measurement.
Because both the distribution of the state and the measurement errors are gaussian, we can represent this two-step update process by it's effect on the mean and covariance matrix of our state distribution. We showed above how to update the system's state distribution based on its underlying dynamics between timesteps. The second step, incorporating the measurement using Bayes' theorem, can also be done analytically using the gaussian assumptions.
Multivariate linear algebra tells us that the product of two gaussian distributions can be written as a single gaussian distribution. In other words, if:
and
then
with
Putting it all together with our dynamics update, our two-step update rule is given by (dropping over-arrows for clarity):
The whole update process can be written in terms of matrix arithmetic.
More Complicated Example
In the previous example, demonstrated a simple example of a Kalman Filter by building a gaussian model of a state vector, updating it with fixed dynamics between time steps, and incorporating measurements with gaussian errors. We shall now walk through a model that is made more complicated by making the dynamics non-deterministic and by allowing the dynamics to vary between each time step.
To make things more complicated, instead of just writing:
we will include:
- A term that represents some time-varying influence on the system outside of the constant dynamics
- Noise associated with the system's dynamics
To include time-varying influence on the system, we will add a term that can vary at each time step, known as the "influence" term. We will denote it as $\overleftrightarrow{B}\vec{u_i}$, where $\vec{u_i}$ is a vector representing some external influence and $\overleftrightarrow{B}$ is a matrix that transforms the influence into its effect on the state of the system. (In principal, we could just include this as a single vector at each time $t_i$ and not include the $B$ matrix, but we add it here to adhere to conventions and also to convey that the source of the influence is different than its effect on the state vector.)
So, our dynamics now beecomes
with $\overleftrightarrow{T}$ and $\overleftrightarrow{B}$ constant but allowing $\vec{u_i}$ to vary per time step. If $B$ and each $\vec{u_i}$ are known, then this doesn't actually add much complexity to our model, it is merely a known translation of our state vector at each time step, and we can easily include it into the update step of our distribution (or, specifically, the gaussian parameters describing our distribution):
Note that adding this additional vector doesn't change the covariance of the state, it only shifts its mean.
By contrast, adding noise to the dynamics will change the covariance matrix of thet state's distribution. For a fixed state vector, one can think of the noise in the following way: First, make a deterministic dynamic update to the state vector, as above. Then, randomly draw from a gaussian noise distribution centered around that newly updated vector, which then becomes our updated state vector.
For a distribution of state vectors, $p_i(\vec{x_i})$, one includes the noise by performing a convolution between the initial state distribution the noise distribution. From the laws of probability, we know:
This allows us to write:
Where the first term in the integral, $p(\vec{x}_{noise} | \vec{x})$, is the noise distribution and the second term, $p(\vec{x})$, is the distribution of the state before adding noise (but after doing the deterministic part of the dynamic update). We will assume that this added noise is a gaussian with mean of 0 and a known covariance matrix $\overleftrightarrow{Q}$. The state vector distribution and the noise are therefore both gaussians, and because the convolution of two gaussians is itself a gaussian, we can write the update step due to zero-mean noise as:
In other words, the noise only changes the distribution by adding to the covariance matrix of our distribution of the state vector.
Thus, we can write down the update rule for our more complicated version of the Kalman Filter as:
- Update the state vector mean and covariance matrix deterministically using $ \overleftrightarrow{T}$ and $\overleftrightarrow{B}\vec{u_i}$
- Add the noise covariance matrix $\overleftrightarrow{Q}$ to the state vector covariance matrix
- Make a measurement and update the mean and covariance matrices using a Bayesian update
Adding this to our previous rules, we get the full Kalman update procedure as
Conclusion
We have shown that a Kalman Filter is nothing more than a series of Bayesian updates that result from a timeseries of data measurements. With a model as defined above, these updates can be written in a matrix form and performed using linear algebra.
Knowing this general structure, one could apply it to many dynamical systems (not just the ones described above). And while the math may be more difficult for different model assumptions, the overall logic of the technique remains valid.