## Monte Carlo sampling for likelihood approximation in state space models

In a previous post I have set the problem of estimating parameters for a state-space model (SSM). That post is a required read, also because I set some notation. My final goal is to show how to construct exact Bayesian inference  for the vector of parameters $\theta$. To achieve this task, I need to introduce special Monte Carlo methods for multi-dimensional integration. This post rapidly screens some notions of Monte Carlo, importance and sequential importance sampling methods for SSM.

We have already learned that SSM have an intractable likelihood. This means that the analytic expression of the likelihood function for the vector of parameters $\theta$ is not known. We can also say that a likelihood function is intractable when it is difficult to approximate, though this concept is kind of vague. What is considered to be “difficult” is relative: let’s say that the integrals involved in said likelihood cannot be solved using standard numerical methods such as quadrature.

As a reminder, for given data $y_{1:T}$ we want to approximate the likelihood function

$p(y_{1:T}|\theta)=p(y_1|\theta)\prod_{t=2}^T p(y_t|y_{1:t-1};\theta)$

and I have shown that it is possible to write

\begin{aligned} p(y_{1:T}|\theta)&=\int p(y_{1:T},x_{0:T}|\theta)dx_{0:T} = \int p(y_{1:T}|x_{0:T},\theta)\times p(x_{0:T}|\theta)dx_{0:T}\\ &=\int \prod_{t=1}^T p(y_{t}|x_{t},\theta)\times \biggl\{p(x_0|\theta)\prod_{t=1}^T p(x_{t}|x_{t-1},\theta)\biggr\} dx_{0:T}. \end{aligned}

So the question is how to find an approximation for this $T+1$-dimensional integral. We can approach the problem from different angles, all interconnected.

## Standard Monte Carlo integration

We can write $p(y_{t}|y_{1:t-1},\theta)$ as

$p(y_{t}|y_{1:t-1},\theta) = \int p(y_t|x_t,\theta)p(x_t|y_{1:t-1},\theta)dx_t = \mathbb{E}(p(y_t|x_t,\theta)).$

That is the integration problem can be interpreted as taking the expectation of the conditional density $p(y_t|x_t,\theta)$  with respect to the law $p(x_t|y_{1:t-1},\theta)$. This means writing $\mathbb{E}(p(y_t|x_t,\theta))\equiv \mathbb{E}_{Y_t|X_t}(p(y_t|x_t,\theta))$. The interpretation of an integral as a probabilistic expectation is at the core of Monte Carlo methods.

It is natural to approximate an expectation using an empirical mean, as follows: using a computer program, generate independently $N$ draws from $p(x_t|y_{1:t-1},\theta)$, then invoke the law of large numbers.

1. Produce $N$ independent draws $x_t^i\sim p(x_t|y_{1:t-1},\theta)$, $\quad i=1,...,N$.
2. For each $x_t^i$ compute $p(y_t|x_t^i,\theta)$. Notice the conditioning on the sampled $x_t^i$.
3. By the law of large numbers, we have $\frac{1}{N}\sum_{i=1}^N p(y_t|x_t^i,\theta)\approx \mathbb{E}(p(y_t|x_t,\theta))$ and the approximation improves for increasing $N$. In fact we can write $\frac{1}{N}\sum_{i=1}^N p(y_t|x_t^i,\theta)\rightarrow \mathbb{E}(p(y_t|x_t,\theta)), \qquad N\rightarrow\infty$.

The error term for the approximation is $O(N^{-1/2})$ regardless the dimension of $x_t$, which is remarkable. Notice, the convergence property 3 implies that the Monte Carlo estimate is consistent. Another important property, that will be discussed in a  future post, is that the estimate is also unbiased.

The issue here is, how to generate “good” draws (hereafter particles) $x_t^i\sim p(x_t|y_{1:t-1},\theta)$? Here “good” means that we want particles such that the values of $p(y_t|x_t^i,\theta)$ are not negligible. Since these particles are the points where the integrand function $p(y_t|x_t;\theta)$ gets evaluated, we want those particles that are “important” for the evaluation of the integral, and we wish most of the $\{x_t^i\}_{i=1:N}$ to end up in a region of high probability for $p(y_t|x_t;\theta)$.

It turns out that for SSM sequential Monte Carlo (SMC, or “particle filters”) is the winning strategy.

I will not give a thorough introduction to SMC methods. I will only consider a few notions useful to solve our parameter inference problem. I first consider importance sampling as a useful introduction to SMC.

## Importance Sampling

For ease of reading I remove the dependence of quantities on $\theta$. This is a static parameter (i.e. it does not vary with time), and we can consider as implicit the dependence of all probability densities on $\theta$.

Consider the following:

\begin{aligned} p(y_t|y_{1:t-1})&=\int p(y_t|x_{0:t})p(x_{0:t}|y_{1:t-1})dx_{0:t} =\int p(y_t|x_{0:t})p(x_{0:t-1},x_t|y_{1:t-1})dx_{0:t}\\ &=\int p(y_t|x_{0:t})p(x_t|x_{0:t-1},y_{1:t-1})p(x_{0:t-1}|y_{1:t-1})dx_{0:t} = \int p(y_t|x_t)p(x_t|x_{t-1})p(x_{0:t-1}|y_{1:t-1})dx_{0:t}\\ &= \int \frac{p(y_t|x_t)p(x_t|x_{t-1})p(x_{0:t-1}|y_{1:t-1})}{h(x_{0:t}|y_{1:t})}h(x_{0:t}|y_{1:t})dx_{0:t} \end{aligned}

where $h(x_{0:t}|y_{1:t})$ is an arbitrary density function called “importance density” that is non-zero whenever $p(x_{0:t}|y_{1:t})$ is non-zero (the support of $h(x_{0:t}|y_{1:t})$ needs to be greater or equal to the support of $p(x_{0:t}|y_{1:t})$). The purpose of introducing the importance density is to use it as a sampler of random numbers, assuming we don’t know how to sample from the $p(x_{0:t-1}|y_{1:t-1})$ appearing in the penultimate equality. Therefore we should choose a $h(\cdot)$ easy to simulate from.
Notice the two crucial simplifications occurring in the fourth equality: we have $p(x_t|x_{0:t-1},y_{1:t-1})\equiv p(x_t|x_{t-1})$ and $p(y_t|x_{0:t})\equiv p(y_t|x_t)$. The first result is from the Markov property defined in the previous post, and the second one states the independence of $y_t$ from the “history of $\{X_s\}$ prior to time $t$when conditioning on $x_t$ (see the graph in the previous post). Conditional independence was also stated in the previous post.

In conclusion, the importance sampling approach approximates the integral using the following Monte Carlo procedure:

• simulate $N$ samples $\quad x_{0:t}^i\sim h(x_{0:t}|y_{1:t})$ independently, $\qquad i=1,...,N$.
• Construct “importance weights” $w_t^i=\frac{ p(y_t|x_t^i)p(x_t^i|x_{t-1}^i)p(x_{0:t-1}^i|y_{1:t-1})}{h(x_{0:t}^i|y_{1:t})}$.
• $p(y_t|y_{1:t-1})=\mathbb{E}(p(y_t|x_t))\approx \frac{1}{N}\sum_{i=1}^N w_t^i$.

Notice that here we wish to simulate $N$ sequences $x_{0:t}^i$ with $x_{0:t}^i=\{x_0^i,...,x_t^i\}$. Clearly, generating at each time $t$ a “cloud of particles” $\{x_{0:t}^i\}_{i=1:N}$ spanning the entire interval $[0,t]$ is not really computationally appealing, and it is not even clear how to construct the importance density. We need some strategy enabling a sort of sequential mechanism. Moreover, even if we are able to simulate the particles, are we able to evaluate all the densities appearing in $w_t^i$? For some complex models we might not know the analytic expressions for $p(x_{0:t-1}|y_{1:t-1})$ and $p(x_t|x_{t-1})$. This is addressed in next post.

## Sequential importance sampling

When the importance density $h(\cdot)$ is chosen intelligently, an important property is the one that allows a sequential update of the importance weights. That is, for an appropriate $h(\cdot)$ we can write

$w_t^i \propto \frac{ p(y_t|x_t^i)p(x_t^i|x_{t-1}^i)}{h(x_{t}^i|x_{0:t-1}^i,y_{1:t})}w_{t-1}^i$

and this sequential update is a key result to make the computation of the $(T+1)$-dimensional integral bearable. Here I am going to show how to obtain the weights update above.

First of all, consider that  it is up to the analyst to design a density $h(\cdot)$ that is appropriate for the model at hand (easy to evaluate, easy to sample from, generating “important” particles). For example, let’s write an importance density as the product of two densities $h_1(\cdot)$ and $h_2(\cdot)$:

$h(x_{0:t}|y_{1:t})=h_1(x_{t}|x_{0:t-1},y_{1:t})h_2(x_{0:t-1}|y_{1:t-1}).$

Notice that while $h_1$ depends on measurements up to time $t$, instead $h_2$ depends on measurements up to $t-1$. This is because we have freedom in designing $h(\cdot)$, as $h(\cdot)$ is not subject to the constraints imposed on the state space model (Markovianity, conditional independence). For example $h_1(x_{t}|x_{0:t-1},y_{1:t})$ does not have to result in $h_1(x_{t}|x_{t-1})$, though this choice is allowed. For simplicity, in the following I remove subscripts for $h_1$ and $h_2$ and just write $h(x_{0:t}|y_{1:t})=h(x_{t}|x_{0:t-1},y_{1:t})h(x_{0:t-1}|y_{1:t-1})$.

I use the decomposition for the importance density to show the sequential update of the importance weights $w_t^i$, though for ease of writing I remove the superscript $i$. We have:

$w_t=\frac{ p(y_t|x_t)p(x_t|x_{t-1})p(x_{0:t-1}|y_{1:t-1})}{h(x_{0:t}|y_{1:t})} =\frac{ p(y_t|x_t)p(x_t|x_{t-1})p(x_{0:t-1}|y_{1:t-1})}{h(x_{t}|x_{0:t-1},y_{1:t})h(x_{0:t-1}|y_{1:t-1})}.$

Now, note that we can use the Bayes theorem to write the following

\begin{aligned} p(x_{0:t}|y_{1:t}) &= \frac{p(x_{0:t},y_{1:t})}{p(y_{1:t})}\propto p(y_{1:t-1},y_t|x_{0:t})p(x_{0:t})=p(y_t|x_{0:t},y_{1:t-1})p(x_{0:t}|y_{1:t-1})\\ &=p(y_t|x_{t})p(x_{0:t}|y_{1:t-1})=p(y_t|x_{t})p(x_{t}|x_{0:t-1},y_{1:t-1})p(x_{0:t-1}|y_{1:t-1})\\ &=p(y_t|x_{t})p(x_{t}|x_{t-1})p(x_{0:t-1}|y_{1:t-1}) \end{aligned}

and in the derivation I have used both the conditional independence of observations and the Markovianity of the latent state (see the previous post).

We can use this derivation to write $p(x_{0:t-1}|y_{1:t-1})$, so we can express the weights as

$w_t\propto \frac{ p(y_t|x_t)p(x_t|x_{t-1})p(y_{t-1}|x_{t-1})p(x_{t-1}|x_{t-2})p(x_{0:t-2}|y_{1:t-2})}{h(x_{t}|x_{0:t-1},y_{1:t})h(x_{0:t-1}|y_{1:t-1})}=\frac{p(y_t|x_t)p(x_t|x_{t-1})}{h(x_{t}|x_{0:t-1},y_{1:t})}w_{t-1}$

which is the sequential update of the importance weights we wanted to prove. The sequential update is convenient, as when the time index $t$ advances in the simulation we do not need to produce particles always starting at time $0$, instead we can perform computations online, by only retaining the weights computed at time $t-1$.

Notice I used the proportionality symbol $\propto$ as we do not need to carry the constant term $p(y_{1:t-1})$ resulting from the denominator of the Bayes theorem. This term is independent of $\theta$, hence is not going to be relevant for parameters optimization (maximum likelihood approach) nor in Bayesian sampling from the posterior distribution of $\theta$.

Once particles are simulated, we have  the following approximations (notation-wise, I now recover the parameter $\theta$ I previously removed):

\begin{aligned} \hat{p}(y_t|y_{1:t-1};\theta)&=\frac{1}{N}\sum_{i=1}^N w_t^i\\ \hat{p}(y_{1:T}|\theta)&= \prod_{t=1}^T\biggl (\frac{1}{N}\sum_{i=1}^N w_t^i\biggr ). \end{aligned}

Adding back $\theta$ in the notation is important not to make confusion between the likelihood function ${p}(y_{1:T}|\theta)$ and the “evidence” $p(y_{1:T})$ ,where the latter is really independent of $\theta$ (it’s the denominator of the Bayes theorem).

In conclusion, I have outlined some approaches to approximate the likelihood function, by assuming the ability to sample from some importance density $h(\cdot)$. However, I have skipped discussing how to construct such density (indeed, this is a major and problem dependent issue), and a simple possibility is covered in the next post.

It is fair to say that, even though we managed to derive the relations above with success, in practice the computation of the required quantities does not always end up with a good likelihood approximation. This is investigated in the next post, together with strategies to overcome numerical issues.

## Summary

I have started a quick excursus into Monte Carlo methods for the approximation of the likelihood function for state space models (SSM). I covered a naïve Monte Carlo approach, then importance sampling and finally sequential importance sampling. Each of these topics has been investigated at great length in available literature (there would be so much more to consider, for example quasi-Monte Carlo). However, my focus is to give an idea of possibilities, while quickly moving forward towards introducing the simplest sequential Monte Carlo strategy, which is based on sequential importance sampling. Once sequential Monte Carlo is introduced, I will move to Bayesian parameter estimation for SSM.