## Bayes Nordics, an email list

This is an interlude to the series of posts on inference for state space models, which I will restart soon.

I have created an email-list called Bayes Nordics. The goal is to disseminate news on events related to Bayesian analysis, in the Nordic countries.

Workshops, conferences, job openings, courses, even local seminars and meetups. Any Bayes-event happening in the Nordics is welcome.

It is up to the users to provide the content.

Basically, it should become something analogous to what’s being done in the excellent Allstat list (though at a way smaller scale). The intention is to keep the “noise” low. So Bayes Nordics is not a “questions & answers” forum; there are already many good places out there to post questions.

### Why Bayes Nordics

I think we all experienced the frustration of missing an excellent seminar happening in a neighbour University department, which we could have attended if we only knew. Or missed registering for an introductory course in Bayesian statistics just because, well, it’s not always easy to find these information. Many events are simply not widely advertised. Say you don’t use meetup (meetup.com), so you don’t know that there is an afterwork meetup where someone talks about probabilistic machine learning and variational Bayes.

You get the point, the list of examples could go on and on.

So, if you get to know of some interesting event/opportunity, just post it by email at Bayes Nordics. Registration is free and the platform is ads-free (as long as Google keeps it this way).

## Sequential Monte Carlo and the bootstrap filter

In the previous post I have outlined three possibilities for approximating the likelihood function of a state space model (SSM). That post is a required read to follow the topics treated here. I concluded with sequential importance sampling (SIS), which provides a convenient online strategy. However, in some cases SIS suffers from some numerical problems that I will address here. The key concept is the introduction of a resampling step. Then I introduce the simplest example of a sequential Monte Carlo (SMC) algorithm, the bootstrap filter. Recommendations on how to make best use of resampling are discussed.

### Sequential Importance Sampling with Resampling

An important result derived in the previous post is the sequential update of the importance weights. That is, by denoting with $w_t^i$ the importance weight for particle $x_t^i$ we have

$w_t^i\propto \frac{p(y_t|x_t^i)p(x_t^i|x_{t-1}^i)}{h(x_{t}^i|x_{0:t-1}^i,y_{1:t})} w_{t-1}^i \qquad i=1,...,N.$

The weights update allows for an online approximation of the likelihood function of parameters $\theta$, that we can write as

\begin{aligned} \hat{p}(y_t|y_{1:t-1};\theta)&=\frac{1}{N}\sum_{i=1}^N w_t^i\\ \hat{p}(y_{1:T}|\theta)&= \prod_{t=1}^T\biggl (\frac{1}{N}\sum_{i=1}^N w_t^i\biggr ). \end{aligned}

At this point, an approximate maximum likelihood estimate can be obtained by searching for the following (typically using a numerical optimizer)

$\hat{\theta} = \arg\max_\theta \hat{p}(y_{1:T}|\theta)$

or $\hat{p}(y_{1:T}|\theta)$ could be plugged into a Bayesian procedure for sampling from the posterior distribution $\pi(\theta|y_{1:T})$ (more on this in future posts).

Actually, while the procedure above can be successful, in general depending on the type of data and the adequacy of the model we employ to “fit” the data, some crucial problems can occur.

Particle degeneracy: the convenient representation of $w_t^i$ as a function of $w_{t-1}^i$ is problematic when particle $x_{t-1}^i$ produces a very small weight $w_{t-1}^i$. With “very small” I mean that  its numeric value as returned by a computer (floating point representation) is zero, even though its mathematical value is actually positive. This is because computers have a limited ability to represent very small and very large values: if a value is “too small” it will be represented as exactly zero on a computer, and if it is “too large” it will be represented as infinity. The former case, when $w_t^i$ is too small, is said to produce an underflow while a too large weight might produce an overflow. We focus on the underflow, as it is more common. Consider for example the plot below, where at time 44 there is a clear outlier, that is an observation with an unusual value compared to the rest of the data.

This outlier could have been produced by an unusual statistical fluctuation, and at time $t=44$ the datum $y_{t}$ might be a realization from a tail of the distribution with  density $p(y_{t}|x_{t})$. While this is legitimate, numerically it can create issues for the stability of our calculations. Say that $p(y_{t}|x_{t})=N(y_t;x_t,0.5^2)$, that is a Gaussian distribution with mean $x_t$ and standard deviation 0.5. Let’s see what happens if I anticipate some topic treated later on.

Say that I choose as importance density the transition density of the latent process, that is I set $h(x_{t}^i|x_{0:t-1}^i,y_{1:t}) \equiv p(x_{t}^i|x_{t-1}^i)$, then the weight becomes

$w_t^i\propto N(y_t;x_t^i,0.5^2)w_{t-1}^i \qquad i=1,...,N$

with some abuse of notation. Now, for the given choice of proposal density (and if I have been employing a model appropriate for the data) then at the previous time instant 43, most particles will have values somewhere around  30 (see the y-axis in the figure), and we could expect that if $x_{43}^i\approx 30$ then most of the particles $x_{44}^i$ generated from $p(x_{44}^i|x_{43}^i)$ should take values not too far from 30. And here is a problem: from the figure we have $y_{44}\approx 4$ and the computer implementation for the density function of $N(y_{44};x_{44}^i,0.5^2)$ might underflow and return zero for many particles (I will describe some tricks mitigating this problem in a future post). Clearly, all those particles having zero weight doom the values of the descendant particles, because each weight depends on the previous one. A similar scenario could happen even without outliers, if we simulate from a model not appropriate for the data. Clearly, as the time horizon $T$ increases, it is not unlikely to incur in such a problem.

The phenomenon were most particles have very small weights is called particle degeneracy and for many years has prevented these Monte Carlo strategies from being effective. When particle degeneracy occurs the likelihood approximation is poor with a large variance, because the numerical calculation of the underlying integrals is essentially performed by the few particles with non-zero weight.

The striking idea that changed everything is the introduction of a resampling step, whose importance in sequential Monte Carlo was first studied in Gordon et al (1993), based on ideas from Rubin (1987). The idea is simple and ingenious and has revolutionised the practical application of sequential methods. The resulting algorithm is the “sequential importance sampling with resampling” (SISR) but we will just call it a sequential Monte Carlo algorithm.

### Resampling

The resampling idea is to get rid in a principled way of the particles with small weight and multiply the particles with large weight. Recall that generating particles is about exploring regions of the space where the integral has most of its mass. Therefore, we want to focus the computational effort on the “promising” parts of the space. This is easily accomplished if we “propagate forward” from the promising particles, which are those with a non-negligible weight $w_t^i$. We proceed as follows:

• Normalize the weights to sum to 1, that is compute $\tilde{w}_t^i=w_t^i/\sum_{i=1}^N w_t^i$.
• interpret $\tilde{w}_t^i$ as the probability associated to $x_t^i$ in the weighted set $\{x_t^i,\tilde{w}_t^i,i=1,...,N\}$. If this can help, imagine the particles to be balls contained in an urn: some balls are large (large $\tilde{w}_t^i$) others are small.
• Resampling: sample $N$ times with replacement from the weighted set, to generate a new sample of $N$ particles. This means that we put a hand in the urn, extract a ball and record its index $i$ then put the ball back in the urn and repeat the extraction and recording for $N$ times (more on how to implement resampling at the end of this post). Clearly it is more likely to extract large balls than small ones.
• Replace the old particles with the new ones $\{\tilde{x}_t^1,...,\tilde{x}_t^N\}$. Basically, we empty the urn then fill it up again with copies of the balls having the recorded indices. Say that we have extracted index $i=3$ five times, we put in the urn five copies of the ball with index $i=3$.
• Reset all the unnormalised weights to $w_t^i=1/N$ (the resampling has destroyed the information on “how” we reached time $t$).

Since resampling is done with replacement, a particle with a large weight is likely to be drawn multiple times. Particles with very small weights are not likely to be drawn at all. Nice!

The reason why performing a resampling step is not only a numerically convenient trick to overcome (sometimes!) particle degeneracy, but is also probabilistically allowed (i.e. it preserves our goal to obtain an approximation targeting $p(y_t|y_{1:t-1};\theta)$) is illustrated in a great post by Darren Wilkinson.

### Forward propagation

We now move the particles to the next time point $t+1$, that is we propagate forward the state of the system by simulating particles $x_{t+1}^i$. And how do we perform this step? We take advantage of the important particles we have just resampled, using the importance density to compute the move $\tilde{x}^i_t \rightarrow x_{t+1}^i.$
The important fact is that we often propagate from important particles, since these are appearing several times in the urn, because of the resampling step with replacement. Therefore several of the propagated particles $x_{t+1}^i$ might originate from a common “parent” $\tilde{x}_t^i$. For illustration see the picture below.

Read the picture from top to bottom: we start with particles having all the same size, which means they have equal weight $w_t^i=1/N$. Particles are then evaluated on the density $p(y_t|x_t^i)$ depicted with a curve. The particles weight is computed and you can see that some particles have a larger size (large weight), others are smaller and some have “disappeared”, which means that their weight is negligible. At this point the particles are resampled: notice at the resampling stage the largest particle in the figure happens to be resampled three times while others fewer times. Once resampling is performed, all the resampled particles get the same size because all weights are reset to $1/N$ as described above. Now it is time to propagate the resampled particles forward to time $t+1$: we create a new generation of particles by moving forward only the ones we have resampled. This means that some of the particles having very small weight at time $t$ will not be propagated to $t+1$  (notice in the figure, some particles do not have arrows departing from them), and the one that has been resampled three times generates three new particles. This implies that at each generation we still have $N$ particles at our disposal, even though some from the previous generation have “died”. I illustrate the use of the resampling step in the bootstrap filter.

### The bootstrap filter

The bootstrap filter is the simplest example of a sequential importance sampling with resampling (SISR) algorithm. It is the “simplest” application of SISR because it assumes $h(x_{t}^i|x_{0:t-1}^i,y_{1:t}) \equiv p(x_{t}^i|x_{t-1}^i)$, that is the law of the Markov process is used as importance sampler to propagate particles forward. This implies the already mentioned simplification for the computation of the weights, and we have

$w_t^i= \frac{1}{N} p(y_t|x_t^i;\theta) \qquad i=1,...,N.$

Notice that in the equation above we have an equality, instead of a proportionality, since after resampling we set weights to be all equal to $1/N$, hence after resampling $w_{t-1}^i=1/N$.

This approach is certainly very practical and appealing, but it comes at a cost. Generating particles from $p(x_{t}^i|x_{t-1}^i)$ means that these are “blind” to data, since this importance density is unconditional to $y_{1:t}$. Hence the propagation step does not take advantage of any information carried by the data. In some scenario this produces an inefficient sampling that may result, again, in particle degeneracy. A popular alternative is the auxiliary particle filter. I do not go further into the possible improvements over the bootstrap filter however some literature is given at the end of this post.

So here is the bootstrap filter in detail.

1. At $t=0$ (initialize) $x^i_0\sim p(x_0)$ and assign $\tilde{w}_0^i=1/N$, for all $i=1,...,N$.
2. At the current $t$ assume to have the weighted particles $\{x_t^i,\tilde{w}_t^i\}$.
3. From the current sample of particles, resample with replacement $N$ times to obtain $\{\tilde{x}_t^i,i=1,...,N\}$.
4. Propagate forward $x_{t+1}^i\sim p(x_{t+1}|\tilde{x}_{t}^i)$, for all $i=1,...,N$.
5. Compute $w_{t+1}^i=p(y_{t+1}|x_{t+1}^i)$ and normalise weights $\tilde{w}_{t+1}^i={w}_{t+1}^i/\sum_{i=1}^N {w}_{t+1}^i$.
6. Set $t:=t+1$ and if $t go to step 2 otherwise go to 7.
7. Return $\hat{p}(y_{1:T}|\theta)= \prod_{t=1}^T\biggl (\frac{1}{N}\sum_{i=1}^N w_t^i\biggr )$.

Recall that $p(x_0)$ is the unconditional density of an arbitrary initial state $x_0$; this density is set by the modeller (alternatively, $x_0$ can be fixed deterministically and then all particles will propagate from a common $x_0$). Notice in step 5 I wrote $w_{t+1}^i=p(y_{t+1}|x_{t+1}^i)$ instead of $w_{t+1}^i=p(y_{t+1}|x_{t+1}^i)/N$. The two formulations are completely equivalent as they only differ by a constant which is irrelevant for the purpose of assigning a weight to a particle. Also, since weights are going to be normalised, the $1/N$ is not really necessary. However if for some reason it is relevant to have a pointwise estimate of the likelihood (as opposed to e.g. optimizing it over $\theta$), then it is important to recover the constant, and write $\hat{p}(y_t|y_{1:t-1};\theta)=\frac{1}{N}\sum_{i=1}^N w_t^i$. In step 7 I have explicitly reported the likelihood approximation, even though for parameter inference the product of the normalizing constants $1/N^T$ can be disregarded.

The seven steps above are the simplest version of a bootstrap filter, where the resampling is performed at every $t$. However, the resampling adds unwanted variability to the estimate of the likelihood function. This extra variability is not really welcome as it makes the estimate more imprecise (and can affect the performance of the pseudo-marginal algorithms I will describe in future posts).

A standard way to proceed is to resample only when necessary, as given by a measure of potential degeneracy of the sequential Monte Carlo approximation, such as the effective sample size (Liu and Chen, 1998). The effective sample size takes values between 1 and $N$ and at time $t$ is approximated via $ESS = 1/\sum_{i=1}^N (\tilde{w}_t^i)^2$. If the degeneracy at $t$ is too high, i.e. the ESS is below a specified threshold (say below $N/2$) then resampling is performed, otherwise no resampling is performed at time $t$.

Finally notice that SISR (and the bootstrap filter) returns an unbiased estimate of the likelihood function. This is completely unintuitive and not trivial to prove. I will go back to this point in a future post.

### Implement resampling

Coding your own version of a resampling scheme should not be necessary: popular statistical software will probably have built-in functions implementing several resampling algorithms. To my knowledge, the four most popular resampling schemes are: residual resampling, stratified resampling, systematic resampling and multinomial resampling. I mentioned above that resampling adds additional unwanted variability to the likelihood approximation. Moreover, different schemes produce different variability. Multinomial resampling is the one that gives the worse performance in terms of added variance, while residual, stratified and systematic resampling are about equivalent, though systematic resampling is often preferred because easier to implement and fast. See Douc et al. 2005 and Hol et al. 2006.

### Summary

I have addressed the problem known as “particles degeneracy” affecting sequential importance sampling algorithms. I have introduced the concept of resampling, and when to perform said resampling. This produces a sequential importance sampling resampling (SISR) algorithm. Then I have introduced the simplest example of SISR, the bootstrap filter. Finally, I have briefly mentioned some results pertaining resampling schemes.

We now have a partial (though useful starting point for further self study) introduction to particle filters / sequential Monte Carlo methods for approximating the likelihood function of a state space model. We are now ready to consider Bayesian parameter inference, including practical examples.

## Monte Carlo sampling for likelihood approximation in state space models

In a previous post I have set the problem of estimating parameters for a state-space model (SSM). That post is a required read, also because I set some notation. My final goal is to show how to construct exact Bayesian inference  for the vector of parameters $\theta$. To achieve this task, I need to introduce special Monte Carlo methods for multi-dimensional integration. This post rapidly screens some notions of Monte Carlo, importance and sequential importance sampling methods for SSM.

We have already learned that SSM have an intractable likelihood. This means that the analytic expression of the likelihood function for the vector of parameters $\theta$ is not known. We can also say that a likelihood function is intractable when it is difficult to approximate, though this concept is kind of vague. What is considered to be “difficult” is relative: let’s say that the integrals involved in said likelihood cannot be solved using standard numerical methods such as quadrature.

As a reminder, for given data $y_{1:T}$ we want to approximate the likelihood function

$p(y_{1:T}|\theta)=p(y_1|\theta)\prod_{t=2}^T p(y_t|y_{1:t-1};\theta)$

and I have shown that it is possible to write

\begin{aligned} p(y_{1:T}|\theta)&=\int p(y_{1:T},x_{0:T}|\theta)dx_{0:T} = \int p(y_{1:T}|x_{0:T},\theta)\times p(x_{0:T}|\theta)dx_{0:T}\\ &=\int \prod_{t=1}^T p(y_{t}|x_{t},\theta)\times \biggl\{p(x_0|\theta)\prod_{t=1}^T p(x_{t}|x_{t-1},\theta)\biggr\} dx_{0:T}. \end{aligned}

So the question is how to find an approximation for this $T+1$-dimensional integral. We can approach the problem from different angles, all interconnected.

## Standard Monte Carlo integration

We can write $p(y_{t}|y_{1:t-1},\theta)$ as

$p(y_{t}|y_{1:t-1},\theta) = \int p(y_t|x_t,\theta)p(x_t|y_{1:t-1},\theta)dx_t = \mathbb{E}(p(y_t|x_t,\theta)).$

That is the integration problem can be interpreted as taking the expectation of the conditional density $p(y_t|x_t,\theta)$  with respect to the law $p(x_t|y_{1:t-1},\theta)$. This means writing $\mathbb{E}(p(y_t|x_t,\theta))\equiv \mathbb{E}_{Y_t|X_t}(p(y_t|x_t,\theta))$. The interpretation of an integral as a probabilistic expectation is at the core of Monte Carlo methods.

It is natural to approximate an expectation using an empirical mean, as follows: using a computer program, generate independently $N$ draws from $p(x_t|y_{1:t-1},\theta)$, then invoke the law of large numbers.

1. Produce $N$ independent draws $x_t^i\sim p(x_t|y_{1:t-1},\theta)$, $\quad i=1,...,N$.
2. For each $x_t^i$ compute $p(y_t|x_t^i,\theta)$. Notice the conditioning on the sampled $x_t^i$.
3. By the law of large numbers, we have $\frac{1}{N}\sum_{i=1}^N p(y_t|x_t^i,\theta)\approx \mathbb{E}(p(y_t|x_t,\theta))$ and the approximation improves for increasing $N$. In fact we can write $\frac{1}{N}\sum_{i=1}^N p(y_t|x_t^i,\theta)\rightarrow \mathbb{E}(p(y_t|x_t,\theta)), \qquad N\rightarrow\infty$.

The error term for the approximation is $O(N^{-1/2})$ regardless the dimension of $x_t$, which is remarkable. Notice, the convergence property 3 implies that the Monte Carlo estimate is consistent. Another important property, that will be discussed in a  future post, is that the estimate is also unbiased.

The issue here is, how to generate “good” draws (hereafter particles) $x_t^i\sim p(x_t|y_{1:t-1},\theta)$? Here “good” means that we want particles such that the values of $p(y_t|x_t^i,\theta)$ are not negligible. Since these particles are the points where the integrand function $p(y_t|x_t;\theta)$ gets evaluated, we want those particles that are “important” for the evaluation of the integral, and we wish most of the $\{x_t^i\}_{i=1:N}$ to end up in a region of high probability for $p(y_t|x_t;\theta)$.

It turns out that for SSM sequential Monte Carlo (SMC, or “particle filters”) is the winning strategy.

I will not give a thorough introduction to SMC methods. I will only consider a few notions useful to solve our parameter inference problem. I first consider importance sampling as a useful introduction to SMC.

## Importance Sampling

For ease of reading I remove the dependence of quantities on $\theta$. This is a static parameter (i.e. it does not vary with time), and we can consider as implicit the dependence of all probability densities on $\theta$.

Consider the following:

\begin{aligned} p(y_t|y_{1:t-1})&=\int p(y_t|x_{0:t})p(x_{0:t}|y_{1:t-1})dx_{0:t} =\int p(y_t|x_{0:t})p(x_{0:t-1},x_t|y_{1:t-1})dx_{0:t}\\ &=\int p(y_t|x_{0:t})p(x_t|x_{0:t-1},y_{1:t-1})p(x_{0:t-1}|y_{1:t-1})dx_{0:t} = \int p(y_t|x_t)p(x_t|x_{t-1})p(x_{0:t-1}|y_{1:t-1})dx_{0:t}\\ &= \int \frac{p(y_t|x_t)p(x_t|x_{t-1})p(x_{0:t-1}|y_{1:t-1})}{h(x_{0:t}|y_{1:t})}h(x_{0:t}|y_{1:t})dx_{0:t} \end{aligned}

where $h(x_{0:t}|y_{1:t})$ is an arbitrary density function called “importance density” that is non-zero whenever $p(x_{0:t}|y_{1:t})$ is non-zero (the support of $h(x_{0:t}|y_{1:t})$ needs to be greater or equal to the support of $p(x_{0:t}|y_{1:t})$). The purpose of introducing the importance density is to use it as a sampler of random numbers, assuming we don’t know how to sample from the $p(x_{0:t-1}|y_{1:t-1})$ appearing in the penultimate equality. Therefore we should choose a $h(\cdot)$ easy to simulate from.
Notice the two crucial simplifications occurring in the fourth equality: we have $p(x_t|x_{0:t-1},y_{1:t-1})\equiv p(x_t|x_{t-1})$ and $p(y_t|x_{0:t})\equiv p(y_t|x_t)$. The first result is from the Markov property defined in the previous post, and the second one states the independence of $y_t$ from the “history of $\{X_s\}$ prior to time $t$when conditioning on $x_t$ (see the graph in the previous post). Conditional independence was also stated in the previous post.

In conclusion, the importance sampling approach approximates the integral using the following Monte Carlo procedure:

• simulate $N$ samples $\quad x_{0:t}^i\sim h(x_{0:t}|y_{1:t})$ independently, $\qquad i=1,...,N$.
• Construct “importance weights” $w_t^i=\frac{ p(y_t|x_t^i)p(x_t^i|x_{t-1}^i)p(x_{0:t-1}^i|y_{1:t-1})}{h(x_{0:t}^i|y_{1:t})}$.
• $p(y_t|y_{1:t-1})=\mathbb{E}(p(y_t|x_t))\approx \frac{1}{N}\sum_{i=1}^N w_t^i$.

Notice that here we wish to simulate $N$ sequences $x_{0:t}^i$ with $x_{0:t}^i=\{x_0^i,...,x_t^i\}$. Clearly, generating at each time $t$ a “cloud of particles” $\{x_{0:t}^i\}_{i=1:N}$ spanning the entire interval $[0,t]$ is not really computationally appealing, and it is not even clear how to construct the importance density. We need some strategy enabling a sort of sequential mechanism. Moreover, even if we are able to simulate the particles, are we able to evaluate all the densities appearing in $w_t^i$? For some complex models we might not know the analytic expressions for $p(x_{0:t-1}|y_{1:t-1})$ and $p(x_t|x_{t-1})$. This is addressed in next post.

## Sequential importance sampling

When the importance density $h(\cdot)$ is chosen intelligently, an important property is the one that allows a sequential update of the importance weights. That is, for an appropriate $h(\cdot)$ we can write

$w_t^i \propto \frac{ p(y_t|x_t^i)p(x_t^i|x_{t-1}^i)}{h(x_{t}^i|x_{0:t-1}^i,y_{1:t})}w_{t-1}^i$

and this sequential update is a key result to make the computation of the $(T+1)$-dimensional integral bearable. Here I am going to show how to obtain the weights update above.

First of all, consider that  it is up to the analyst to design a density $h(\cdot)$ that is appropriate for the model at hand (easy to evaluate, easy to sample from, generating “important” particles). For example, let’s write an importance density as the product of two densities $h_1(\cdot)$ and $h_2(\cdot)$:

$h(x_{0:t}|y_{1:t})=h_1(x_{t}|x_{0:t-1},y_{1:t})h_2(x_{0:t-1}|y_{1:t-1}).$

Notice that while $h_1$ depends on measurements up to time $t$, instead $h_2$ depends on measurements up to $t-1$. This is because we have freedom in designing $h(\cdot)$, as $h(\cdot)$ is not subject to the constraints imposed on the state space model (Markovianity, conditional independence). For example $h_1(x_{t}|x_{0:t-1},y_{1:t})$ does not have to result in $h_1(x_{t}|x_{t-1})$, though this choice is allowed. For simplicity, in the following I remove subscripts for $h_1$ and $h_2$ and just write $h(x_{0:t}|y_{1:t})=h(x_{t}|x_{0:t-1},y_{1:t})h(x_{0:t-1}|y_{1:t-1})$.

I use the decomposition for the importance density to show the sequential update of the importance weights $w_t^i$, though for ease of writing I remove the superscript $i$. We have:

$w_t=\frac{ p(y_t|x_t)p(x_t|x_{t-1})p(x_{0:t-1}|y_{1:t-1})}{h(x_{0:t}|y_{1:t})} =\frac{ p(y_t|x_t)p(x_t|x_{t-1})p(x_{0:t-1}|y_{1:t-1})}{h(x_{t}|x_{0:t-1},y_{1:t})h(x_{0:t-1}|y_{1:t-1})}.$

Now, note that we can use the Bayes theorem to write the following

\begin{aligned} p(x_{0:t}|y_{1:t}) &= \frac{p(x_{0:t},y_{1:t})}{p(y_{1:t})}\propto p(y_{1:t-1},y_t|x_{0:t})p(x_{0:t})=p(y_t|x_{0:t},y_{1:t-1})p(x_{0:t}|y_{1:t-1})\\ &=p(y_t|x_{t})p(x_{0:t}|y_{1:t-1})=p(y_t|x_{t})p(x_{t}|x_{0:t-1},y_{1:t-1})p(x_{0:t-1}|y_{1:t-1})\\ &=p(y_t|x_{t})p(x_{t}|x_{t-1})p(x_{0:t-1}|y_{1:t-1}) \end{aligned}

and in the derivation I have used both the conditional independence of observations and the Markovianity of the latent state (see the previous post).

We can use this derivation to write $p(x_{0:t-1}|y_{1:t-1})$, so we can express the weights as

$w_t\propto \frac{ p(y_t|x_t)p(x_t|x_{t-1})p(y_{t-1}|x_{t-1})p(x_{t-1}|x_{t-2})p(x_{0:t-2}|y_{1:t-2})}{h(x_{t}|x_{0:t-1},y_{1:t})h(x_{0:t-1}|y_{1:t-1})}=\frac{p(y_t|x_t)p(x_t|x_{t-1})}{h(x_{t}|x_{0:t-1},y_{1:t})}w_{t-1}$

which is the sequential update of the importance weights we wanted to prove. The sequential update is convenient, as when the time index $t$ advances in the simulation we do not need to produce particles always starting at time $0$, instead we can perform computations online, by only retaining the weights computed at time $t-1$.

Notice I used the proportionality symbol $\propto$ as we do not need to carry the constant term $p(y_{1:t-1})$ resulting from the denominator of the Bayes theorem. This term is independent of $\theta$, hence is not going to be relevant for parameters optimization (maximum likelihood approach) nor in Bayesian sampling from the posterior distribution of $\theta$.

Once particles are simulated, we have  the following approximations (notation-wise, I now recover the parameter $\theta$ I previously removed):

\begin{aligned} \hat{p}(y_t|y_{1:t-1};\theta)&=\frac{1}{N}\sum_{i=1}^N w_t^i\\ \hat{p}(y_{1:T}|\theta)&= \prod_{t=1}^T\biggl (\frac{1}{N}\sum_{i=1}^N w_t^i\biggr ). \end{aligned}

Adding back $\theta$ in the notation is important not to make confusion between the likelihood function ${p}(y_{1:T}|\theta)$ and the “evidence” $p(y_{1:T})$ ,where the latter is really independent of $\theta$ (it’s the denominator of the Bayes theorem).

In conclusion, I have outlined some approaches to approximate the likelihood function, by assuming the ability to sample from some importance density $h(\cdot)$. However, I have skipped discussing how to construct such density (indeed, this is a major and problem dependent issue), and a simple possibility is covered in the next post.

It is fair to say that, even though we managed to derive the relations above with success, in practice the computation of the required quantities does not always end up with a good likelihood approximation. This is investigated in the next post, together with strategies to overcome numerical issues.

## Summary

I have started a quick excursus into Monte Carlo methods for the approximation of the likelihood function for state space models (SSM). I covered a naïve Monte Carlo approach, then importance sampling and finally sequential importance sampling. Each of these topics has been investigated at great length in available literature (there would be so much more to consider, for example quasi-Monte Carlo). However, my focus is to give an idea of possibilities, while quickly moving forward towards introducing the simplest sequential Monte Carlo strategy, which is based on sequential importance sampling. Once sequential Monte Carlo is introduced, I will move to Bayesian parameter estimation for SSM.

Possibilities for self study are endless. Here are a couple of excellent and accessible resources:

## State space models and intractable likelihoods

In this first series of posts, I introduce important tools to construct inference methods for the estimation of parameters $\theta$ in stochastic models. Stochastic models are characterized by randomness in their mathematical nature, and since at first I focus on models having dynamic features, these models are defined by stochastic processes.

I will start by introducing a class of dynamic models known as state space models.

For general (non-linear, non-Gaussian) state space models it is only relatively recently that a class of algorithms for exact parameter inference has been devised, in the Bayesian framework.  In a series of 4-5 posts I will construct the simplest example of this class of pseudo-marginal algorithms, now considered the state-of-art tool for parameter estimation in nonlinear state space models. Pseudo-marginal methods are not exclusively targeting state space models, but are able to produce exact Bayesian inference whenever a positive and unbiased approximation of the likelihood function is available, no matter the underlying model.

I will first define a state space model, then introduce its likelihood function, which turns out to be intractable. I postpone to the next post the construction of Monte Carlo methods for approximating the likelihood function.

### State space models

A very important class of models for engineering applications, signal processing, biomathematics, systems biology, ecology etc., is the class of state-space models (SSM). [In some literature the terms SSM and hidden Markov model (HMM) have been used interchangeably, though some sources make the explicit distinction that in HMM states are defined over a discrete space while in SSM states vary over a continuous space.]

Suppose we are interested in modelling a system represented by a (possibly multidimensional) continuous-time stochastic process $\{X_t\}_{t\geq 0}$, where $X_t$ denotes the state of the system at a time $t\geq 0$. The notation $\{X_t\}_{t\geq 0}$ denotes the ensemble of possible values taken by the system for a continuous time $t\geq 0$.

However, in many experimental situations the experimenter does not have access to  measurements from  $\{X_t\}_{t\geq 0}$ but rather to noisy versions corrupted with “measurement error”. In other words the true state of the system is unknown, because $\{X_t\}_{t\geq 0}$ is latent (unobservable), and we can only get to know something about the system via some noisy measurements. I denote the available measurements (data) with $y_1,y_2,...,y_T$ and use $\{Y_t\}_{t\geq 1}$ to denote the process producing the actual observations at $T$ discrete time points. For simplicity of notation I assume that measurements are obtained at integer observational times $\{1,2...,T\}$. Each $y_t$ can be multidimensional ($t=1,...,T$) but it does not need to have the same dimension of the corresponding $X_t$, for example some coordinate of $\{X_t\}_{t\geq 0}$ might be unobserved. Therefore, the only available information we get from our system is rather partial: (i) the system of interest is continuous in time but measurements are obtained at discrete times and (ii) measurements do not reflect the true state of the system $\{X_t\}_{t\geq 0}$, because the $y_1,...,y_T$ are affected with some measurement noise. For example we could have $y_t=X_t + \varepsilon_t$, with $\varepsilon_t$ some random noise.

In general $\{X_t\}_{t\geq 0}$ and $\{Y_t\}_{t\geq 1}$  can be either continuous– or discrete–valued stochastic processes. However in the following I assume both processes to be defined on continuous spaces.

I use the notation $z_{1:T}$ to denote a sequence $\{z_1,...,z_T\}$. Therefore, data can be written $y_{1:T}$. For the continuous time process $\{X_t\}_{t\geq 0}$ I use $X_{1:T}=\{X_1,...,X_T\}$  to denote realizations of the process at times $\{1,2...,T\}$. Clearly, none of the $X_t$ values is known.

Assume that the dynamics of $\{X_t\}_{t\geq 0}$  are parametrized by a model having a (vector) parameter $\theta$. The value of $\theta$ is unknown and our goal is to learn something about $\theta$ using available data. That is to say, we wish to produce inference about $\theta$. I could write $\{X_t(\theta)\}_{t\geq 0}$  in place of $\{X_t\}_{t\geq 0}$, but this makes the notation a bit heavy.

State space models are characterized by two properties: Markovianity of the latent states and conditional independence of measurements.

Markovianity: $\{X_t\}_{t\geq 0}$ is assumed a Markov stochastic process, with transition density $p(x_t|x_s)$, for $s. That is, “given the present state, the past is independent of the future”, so if time $s$ is the “present”, then $p(x_t|\text{history of } \{X_u\}_{u\leq s}, y_{1:t-1})\equiv p(x_t|x_s)$. Also in this case, for simplicity we assume implicit the conditioning on $\theta$, instead of writing $p(x_t|x_s;\theta)$. Specifically for our inference goals, we are interested in transitions between states corresponding to contiguous (integer) observational times, that is $p(x_t|x_{t-1})$. Also “the past is independent of the future, given the present”, meaning that $p(x_{t-1}|x_{t:T},y_{t:T})=p(x_{t-1}|x_t)$.

Conditional independence: measurements are assumed conditionally independent given a realization of the corresponding latent state $\{X_t\}$. That is, $p(y_t|x_{0:t},y_{1:t-1})=p(y_t|x_t)$.

Markovianity and conditional independence can be represented graphically:

Notice each white node $x_t$ is only able to directly influence the next white node $x_{t+1}$ and $y_t$. Also, each grey node $y_t$ is unable to influence other measurements directly. [This does not mean observations are independent: for example $p(y_2|y_1)=\frac{1}{p(y_1)}\int p(y_1,y_2,x_{1:2})dx_{1:2 }=\frac{1}{p(y_1)}\int p(y_1,y_2|x_{1},x_2)p(x_{1},x_2)dx_{1:2 }=\frac{1}{p(y_1)}\int p(y_2|x_2)p(y_1|x_1)p(x_2|x_1)p(x_1)dx_{1:2}$ and evidently it results in $p(y_2|y_1) \neq p(y_2)$. If equality did hold $y_2$ would be independent of $y_1$.]

Notice $x_0$ is the initial state of the system at some arbitrary time prior to the first observational time $t=1$. By convention we can set this arbitrary time to be $t=0$.

In summary, a compact, probabilistic representation of a state-space model is given by the conditional distribution of the observable variable at $t$ given the latent state, and the distribution of the evolution of the (Markovian) state, that is the transition density. Optionally, the initial state $x_0$ can be a fixed deterministic value or have its own (unconditional) distribution $p(x_0)$ which might depend or not on $\theta$.

\begin{aligned} Y_t | X_t = x_t & \sim p(y_t | x_t;\theta) &\mbox{(Observation density)} \\ X_{t } | X_{t-1} = x_{t-1} & \sim p(x_{t } | x_{t-1};\theta) &\mbox{(Transition density)} \\ X_0 & \sim p(x_0) &\mbox{(Initial distribution)}. \end{aligned}

Example: Gaussian random walk

A trivial example of a (linear, Gaussian) state space model is

\begin{aligned} y_t &= x_{t}+e_{t}, \quad e_{t}\sim_{iid} N(0,r^2)\\ x_t &= x_{t-1}+\xi_{t}, \quad \xi_{t}{\sim}_{iid} N(0,q^2),\quad x_0=0. \end{aligned}

with $\theta=(q,r)$. Therefore

\begin{aligned} p(y_t|x_{t})&=N(y_t;x_{t},r^2)\\ p(x_t|x_{t-1})&=N(x_t;x_{t-1},q^2). \end{aligned}

### Parameter inference and the likelihood function for SSMs

As anticipated, I intend to cover tools for statistical inference for the vector of parameters $\theta$, and in particular discuss Bayesian inference methods for SSM.

This requires introducing some quantities:

• $p(y_{1:T}|\theta)$ is the likelihood function of $\theta$ based on measurements $y_{1:T}$.
• In the Bayesian framework $\theta$ is a random quantity and $\pi(\theta)$ is its prior density (I always assume continuous-valued parameters). It encodes our knowledge about $\theta$ before we “see” the current data $y_{1:T}$.
• The Bayes theorem gives the posterior distribution of $\theta$, enclosing uncertainty about $\theta$ for given data:
$\pi(\theta|y_{1:T})=\frac{p(y_{1:T}|\theta)\pi(\theta)}{p(y_{1:T})}$.
• $p(y_{1:T})$ is the marginal likelihood (evidence), given by $p(y_{1:T})= \int p(y_{1:T}|\theta)\pi(\theta)d\theta$.
• inference based on $\pi(\theta|y_{1:T})$ is called Bayesian inference.

Goal: we wish to produce Bayesian inference on $\theta$. In principle this involves  writing down the analytic expression of $\pi(\theta|y_{1:T})$ and study its properties. However, for models of realistic complexity, what is typically performed is some kind of Monte Carlo sampling of pseudo-random draws from the posterior $\pi(\theta|y_{1:T})$. Then we can have a finite-samples approximation  of the marginal posteriors $\pi_j(\theta|y_{1:T})\equiv \pi(\theta_j|y_{1:T})$ ($j=1,...,\dim(\theta)$) compute the sample means of the marginals, quantiles etc. This way we perform uncertainty quantification for all components of $\theta$, for a given model and available data.

Now, the main problem preventing a straightforward sampling from the posterior, is that for nonlinear, non-Gaussian SSM the likelihood function is not available in closed form nor it is possible to derive a closed-form approximation. It is intractable. Let’s see why.

In a SSM data $y_1,...,y_T$ are not independent, they are only conditionally independent. This means that we cannot write $p(y_1,...,y_T|\theta)$ as a product of unconditional densities: instead we have

$p(y_{1:T}|\theta)=p(y_1|\theta)\prod_{t=2}^Tp(y_{t}|y_{1:t-1},\theta)$

with the convention $p(y_1|y_{1:0},\theta)\equiv p(y_1|\theta)$.

The problem is that all densities in the expression above are unavailable in closed form, hence unknown. If these were available we could either use an algorithm to find a (local) maximizer $\hat{\theta}$ to $p(y_{1:T}|\theta)$ (the maximum likelihood estimate of $\theta$), or plug the likelihood into the posterior $\pi(\theta|y_{1:T})$ and perform Bayesian inference.

The reason for the unavailability of a closed-form expression for the likelihood is the latency of process $\{X_t\}_{t\geq 0}$, on which data depend. In fact we have:

\begin{aligned} p(y_{1:T}|\theta)&=\int p(y_{1:T},x_{0:T}|\theta)dx_{0:T} = \int p(y_{1:T}|x_{0:T},\theta)\times p(x_{0:T}|\theta)dx_{0:T}\\ &=\int \prod_{t=1}^T p(y_{t}|x_{t},\theta)\times \biggl\{p(x_0|\theta)\prod_{t=1}^T p(x_{t}|x_{t-1},\theta)\biggr\} dx_{0:T}. \end{aligned}

The expression above is intractable for two reasons:

• it is a $(T+1)$-dimensional integral, and
• for most (nontrivial) models, the transition densities $p(x_{t}|x_{t-1})$ are unknown.

Basically the only way to solve the integral when $T$ gets large is via Monte Carlo methods. A special case amenable for closed-form computation is the linear SSM with Gaussian noise (see the Gaussian random walk example): for this case the Kalman filter can be employed to return exact maximum likelihood inference. In the SSM literature important (Gaussian) approximations are given by the extended and unscented Kalman filters.

However, for nonlinear and non-Gaussian SSM, sequential Monte Carlo methods (or “particle filters”) are presently the state-of-art methodology. Sequential Monte Carlo (SMC) methods will be considered in a future post. SMC is a convenient tool to implement the pseudo-marginal methodology for exact Bayesian inference that I intend to outline in this first series of posts.

### Summary

I have introduced minimal notions to set the inference problem for parameters of state space models (SSM). The final goal is to summarize a methodology for exact Bayesian inference, the pseudo-marginal method. This will be outlined in future posts, with focus on SSM. I have also stated some of the computational issues preventing a straightforward implementation of likelihood based methods for the parameters of SSM. In the next two posts I consider some Monte Carlo strategies for approximating the likelihood function.