Why and how pseudo-marginal MCMC work for exact Bayesian inference

I will describe a breakthrough strategy for “exact-approximate” Bayesian inference. The apparent contradiction in the terminology is due to the surprising result in Beaumont (2003) and Andrieu and Roberts (2009) where it is shown that plugging a non-negative and unbiased stochastic approximation of the likelihood function into an MCMC sampler is enough to obtain exact Bayesian inference for model parameters.

The pseudo-marginal approach

I have decided to write about pseudo-marginal MCMC methods since the original literature introducing these methods can be fairly intimidating. Though this need not be. I will show how it is very simple to prove the validity of the approach. Theoretical difficulties may arise depending on the type of pseudo-marginal method we choose to consider for a specific case study. However the basic idea is fairly simple to describe and I proceed at a slow step-by-step pace. Other accessible intro to the topic are in the Resources section at the end of this post.

Suppose we wish to use Markov chain Monte Carlo (MCMC) to sample from a distribution having probability density (probability mass) function \pi(x). Suppose that \pi(x) is not available in closed form and that its pointwise evaluation is not possible at any given x. Finally, suppose that a non-negative and “unbiased” random estimator of \pi(x) is available (we define what we mean with unbiased estimator further below), and we denote this estimator with \hat{\pi}(x). Then an intuitive thing to do is to use \hat{\pi}(x) in place of {\pi}(x) into a Metropolis-Hastings algorithm. This way the (approximate) Metropolis-Hastings ratio for a proposal x'\sim q(x'|x) becomes (see this introductory post to Metropolis-Hastings if necessary)

\hat{r}:=\frac{\hat{\pi}(x')}{\hat{\pi}(x)}\frac{q(x|x')}{q(x'|x)}

and x' is accepted with probability \min(1,\hat{r}). Although it is not evident yet, what we have just written is an example of a pseudo-marginal method, but I will clarify this point below. Therefore in practice applying the method is relatively simple.

The astonishing result provided by a pseudo-marginal method is that the Markov chain constructed as above — where proposals x' are accepted according to \hat{r} instead of the true and unavailable acceptance ratio r (where r uses the exact \pi(\cdot)) — has \pi(x) as stationary distribution, the “exact” target. Therefore, we can use the approximate \hat{\pi}(x) into an MCMC sampler, while still obtaining exact Monte Carlo sampling from \pi(x).

In this post whenever we write “exact sampling” we mean “exact up to Monte Carlo error”, where the latter can be arbitrarily decreased by increasing the number of MCMC iterations.

A simple construction

We want to show that the Markov chain consisting in proposals accepted with probability \min(1,\hat{r}) has stationary distribution \pi(x), that is as R\rightarrow \infty the generated chain is sampled from \pi(x), with R the number of MCMC iterations.

First of all we need to obtain a non-negative and unbiased random estimator of \pi(x). We now take the opportunity to define what we mean with “unbiased estimator” in this context.

Unbiasedness

An unbiased estimator \hat{\pi}(x) of \pi(x) should be such that \mathbb{E}(\hat{\pi}(x))=\pi(x) for every x, where \mathbb{E} denotes  expectation. However which probability measure do we consider for this expectation? Answering this question shows why the topic of this post is particularly relevant for sampling from intractable distributions or, in Bayesian inference, sampling from posterior distributions having intractable likelihoods.

Before trying to find unbiased estimators \hat{\pi}(x) let’s first see what unbiasedness mean when using Monte Carlo sampling.

It is often the case that while \pi(x) is unknown in closed form, it is instead easy to write the joint distribution of x and some other random quantity \xi we are not really interested in (we call it a “disturbance parameter” or an “auxiliary variable”). Let’s \pi(x,\xi) be their joint distribution. Assume the domain of the marginalized \xi to be continuous (without loss of generality), then marginalization implies integration and

\pi(x)=\int \pi(x,\xi)d\xi

where basically the “unwanted” source of variability \xi is integrated out. But then again we assume that analytic integration is out of reach.

A possibility to approximate the integral above is given by Monte Carlo methods, resulting into a stochastic approximation that we show to be unbiased. We simulate N iid samples \xi^1,...,\xi^N from the distribution of \xi which we denote with p(\xi)

{\pi}(x)=\int {\pi}(x,\xi)d\xi =\int {\pi}(x|\xi)p(\xi)d\xi  \approx \frac{1}{N}\sum_{n=1}^N {\pi}(x|\xi^n), \quad \xi^n\sim_{\mathrm{iid}} p(\xi),\quad n=1,...,N

The second identity rewrites the joint distribution using both conditional probabilities and p(\xi). Finally we  carry out Monte Carlo integration. Notice, we do not need to know the expression of p(\xi), but we need to be able to simulate from it. Clearly, the above assumes that we are able to evaluate \pi(x|\xi) at every x (this construction naturally applies to state-space models, when we consider x as “data” and the \xi^1,...,\xi^N as “particles”).

From the second identity it should be clear that \pi(x) can be considered as the expectation of \pi(x|\xi)  taken with respect to p(\xi), as we have

{\pi}(x)=\int {\pi}(x|\xi)p(\xi)d\xi = \mathbb{E}_\xi [{\pi}(x|\xi)] and the notation \mathbb{E}_\xi emphasizes this fact.

Finally, it is easy to show that the Monte Carlo approximation gives an unbiased estimator of \pi(x), as for any simulated \xi^n we have shown that \mathbb{E}_\xi [{\pi}(x|\xi^n)] = \pi(x) and therefore by denoting with \hat{\pi}(x) the approximated density returned by Monte Carlo we define

\hat{\pi}(x):=\frac{1}{N}\sum_{n=1}^N {\pi}(x|\xi^n) and

\mathbb{E}_\xi [\hat{\pi}(x)]=\mathbb{E}_\xi [\frac{1}{N}\sum_{n=1}^N {\pi}(x|\xi^n)]=\frac{1}{N}\sum_{n=1}^N \mathbb{E}_\xi [{\pi}(x|\xi^n)]=\pi(x)

this holding for any x and for any number of samples N\geq 1. Also, the estimate is clearly non-negative.

Having discussed unbiasedness for simple Monte Carlo it is now easy to show how to use this fact for statistical inference.

Pseudo-marginal MCMC for exact Bayesian inference with intractable likelihoods

In the previous section we have considered a generic \pi(x). Here we use Metropolis-Hastings in a Bayesian context to simulate from a specific distribution, the posterior distribution \pi(\theta|y) for a parameter \theta\in\Theta, given data y. From Bayes’ theorem we have  \pi(\theta|y)=\pi(\theta)p(y|\theta)/\pi(y), with \pi(\theta) the prior of \theta, p(y|\theta) the likelihood function and p(y) the “evidence” (if you are interested in state-space models you might want to check an earlier post).

In most realistic modelling scenario, the likelihood function is unavailable in closed form and we say that the statistical model we are considering has an “intractable likelihood”. However it might be possible to obtain an unbiased non-negative approximation of the likelihood \hat{p}(y|\theta). We show that in this case it is also possible to obtain R samples \theta_1,...,\theta_R from \pi(\theta|y), that is from the exact posterior of \theta.

The previous section used Monte Carlo integration, by independently sampling from a distribution employing \xi\in \mathcal{A}, an “auxiliary” variable we do not care for. Also in this case we consider sampling from an “augmented” space, which is the space \Theta\times \mathcal{A} over which the pair (\theta,\xi) is defined.

Therefore we now artifically sample from the augmented approximate posterior (it is approximate because we have an approximate likelihood) \hat{\pi}(\theta,\xi|y) even if we are ultimately interested in \pi(\theta|y). In this context, in order to write the joint (augmented) posterior we need to introduce the prior for \xi, which we call p(\xi). Recall we are not interested in conducting inference for \xi, therefore the definition of its prior is only guided by convenience as we are required to be able to sample from p(\xi). Also, we assume \theta and \xi a-priori independent, that is their joint prior can be written as \pi(\theta)p(\xi).

As mentioned above, we assume the availability of a non-negative unbiased approximation to the likelihood which we denote \hat{p}(y|\theta). By introducing the usual auxiliary variable, this approximation can be written as

\hat{p}(y|\theta)=\int \hat{p}(y,\xi|\theta)d\xi=\int \hat{p}(y|\xi,\theta)p(\xi|\theta)d\xi=\int \hat{p}(y|\xi,\theta)p(\xi)d\xi.

Notice we took p(\xi|\theta)\equiv p(\xi) bacause of the assumed a-priori independence between \theta and \xi.

Then we require the key assumption of unbiasedness. Again we consider \hat{p}(y|\theta) as an \mathbb{E}_\xi-expectation and we want that

\hat{p}(y|\theta)= \mathbb{E}_\xi(\hat{p}(y|\theta,\xi))=\int \hat{p}(y|\theta,\xi)p(\xi)d\xi=p(y|\theta).

Now that we have set these requirements on the likelihood function, we proceed to write the approximate augmented posterior:

\hat{\pi}(\theta,\xi|y)=\frac{\hat{p}(y|\xi,\theta)p(\xi)\pi(\theta)}{p(y)}

[Notice we put p(y) not \hat{p}(y) at the denominator: this follows from the unbiasedness assumption as we have \int\int \hat{p}(y|\theta,\xi)p(\xi)\pi(\theta)d\xi d\theta = \int \pi(\theta)\{\int \hat{p}(y|\theta,\xi)p(\xi)d\xi\}d\theta=\int \pi(\theta)p(y|\theta)d\theta and the latter is by definition the evidence p(y).]

We know the exact (unavailable) posterior of \theta is

{\pi}(\theta|y)=\frac{{p}(y|\theta)\pi(\theta)}{p(y)}

therefore the marginal likelihood (evidence) is

p(y)=\frac{p(y|\theta)\pi(\theta)}{\pi(\theta|y)} and then we plug the evidence into \hat{\pi}(\theta,\xi|y):

\hat{\pi}(\theta,\xi|y)=\frac{\hat{p}(y|\theta,\xi)p(\xi)\pi(\theta)}{p(y)}=\frac{\pi(\theta|y)\hat{p}(y|\theta,\xi)p(\xi)\pi(\theta)}{p(y|\theta)\pi(\theta)}=\frac{\pi(\theta|y)\hat{p}(y|\theta,\xi)p(\xi)}{p(y|\theta)}.

Now, we know that applying an MCMC targeting \hat{\pi}(\theta,\xi|y) then discarding the output pertaining to \xi corresponds to integrating-out \xi from the posterior, that is marginalize \xi out. Therefore, assuming we use MCMC to perform such marginalization, the draws \theta_1,...,\theta_R we are returned with have the following stationary distribution

\int\hat{\pi}(\theta,\xi|y)d\xi = \frac{\pi(\theta|y)}{p(y|\theta)}\int \hat{p}(y|\theta,\xi)p(\xi)d\xi=\frac{\pi(\theta|y)}{p(y|\theta)}p(y|\theta)=\pi(\theta|y).

Notice we have recovered the exact posterior of \theta.

We have therefore performed a pseudo-marginal approach: “marginal” because we disregard \xi; pseudo because we use \hat{p}(\cdot) not p(\cdot).

In conclusion we have shown that using MCMC on an (artificially) augmented posterior, then discard from the output all the auxiliary variates we generated during the entire MCMC procedure, implies that the sampled parameter draws \theta_1,...,\theta_R are the product of an exact Bayesian inference procedure for \theta.

Besides the theoretical construction we have described,  in practice all that is needed is running a regular MCMC for \theta, without caring of storing the generated samples for \xi.

Follow up

Pseudo-marginal MCMC methods are due to Beaumont (2003) and Andrieu and Roberts (2009) and have revolutionised the application of Bayesian inference for models having intractable likelihoods. In my opinion they constitute one of the most important statistical advancements of the last thirty years.

In particular, the methodology has found an enormous success in inference for state-space models, as sequential Monte Carlo produces unbiased approximations of the likelihood for any number of particles N (though unbiasedness is not trivial to prove in this case, but see Pitt et al and Proposition 9.4.1, page 301 of Del Moral (2004)). It is even possible to obtain exact inference simultaneously for the states and the parameters of state-space models.

However, as shown above, the generality of the approach makes pseudo-marginal methods appealing for a wide range of scenarios, including approximate Bayesian computation and other methods for likelihood-free inference such as synthetic likelihoods.

Of extreme importance is the fact that the results presented above hold irrespectively of how we have obtained the unbiased approximation of the likelihood. If \hat{p}(y|\theta) has been approximated via Monte Carlo using a “large” number N of auxiliary variates or a “small” N, the theoretical result still hold and we sample exactly from \pi(\theta|y) regardless of N. However, for practical applications tuning N appropriately is important, as its value influence the precision of the likelihood approximation which in turn has an effect on the mixing of the chain. See this post and references therein.

Some resources

 

Posted in Uncategorized | Tagged , , , , , , , , , , , , , , | 3 Comments

Tips for coding a Metropolis-Hastings sampler

I will suggest several tips, and discuss common beginner’s mistakes occuring when coding from scratch a Metropolis-Hastings algorithm. I am making this list from the top of my mind, so feel free to propose suggestions by commenting to this post. Most items are related to coding practice rather than actual statistical methodology, and are often not mentioned in widely available literature. If you are an instructor, you might find it useful to mention these issues in your class, to avoid students getting puzzled when coding mathematically flawless algorithms having a misbehaving implementation.

Notice, this post does not discuss the design of a Metropolis-Hastings sampler: that is important issues affecting MCMC, such as bad mixing of the chain, multimodality, or the construction of “ad hoc” proposal distributions are not considered.

I am assuming the reader already knows what a Metropolis-Hastings algorithm is, what it is useful for, and that she has already seen (and perhaps tried coding) some implementation of this algorithm. Even given this background knowledge, there are some subtleties that are good to know and that I will discuss below.

Briefly, what we wish to accomplish with Markov chain Monte Carlo (MCMC) is to sample (approximately) from a distribution having density or probability mass function \pi(x). Here we focus on the Metropolis-Hastings algorithm (MH). In order to do so, assume that our Markov chain is currently in “state” x, and we wish to propose a move x\rightarrow x'. This move is generated from a proposal kernel q(\cdot|x). We write the proposed value x' as x'\sim q(x'|x). Then x' is accepted with probability \alpha or rejected with probability 1- \alpha where

\displaystyle  \alpha=\min\biggl\{1,\frac{\pi(x')}{\pi(x)}\frac{q(x|x')}{q(x'|x)}\biggr\} .

Then \alpha is called acceptance probability.

A very simplified pseudo-algorithm for a single step of MH looks like:

propose x'\sim q(x'|x)
accept x' with probability \alpha
otherwise reject x'

However, since we wish to sample say R times from \pi(x), we better store the result of our simulations into a matrix \texttt{draws} and write (assuming an arbitrary initial state x_0)

for r = 1 to R
   propose x'\sim q(x'|x_{r-1}) 
   compute \alpha=\min\biggl\{1,\frac{\pi(x')}{\pi(x_{r-1})}\frac{q(x_{r-1}|x')}{q(x'|x_{r-1})}\biggr\} 
   accept x' with probability \alpha 
          then store it into draws[r]:= x'
          set \pi(x_r):=\pi(x'), set x_r:=x' 
   otherwise store draws[r]:= x_{r-1}
end

Here \texttt{:=} means “copy the value at the right-hand-side into the left-hand-side”. After a few (sometimes many) iterations the chain  reaches stationarity, and its stationary distribution happens to have  \pi(x) as probability density (or probability mass function).  In the end we will explore the content of \texttt{draws} and disregard some of the initial values, to wash-away the influence of the arbitrary x_0.

1- You are given the pseudo-code above but it’s still unclear how to use it for sampling from \pi(x)

The very first time I heard of MH (self study and no guidance available) was in 2008. So I thought “ok let’s google some article!” and I recall that all resources I happened to screen were simply stating (just as I wrote above) “accept the move with probability \alpha and reject otherwise”. Now, if you are new to stochastic simulation the quoted sentence looks quite mysterious. The detail that was missing in my literature search was the logical link between the concept of acceptance probability and that of sampling.

Here is the link: you have to randomly decide between two possible outcomes of a random decision: accept or reject a move, and each of these two outcomes has its own probability (\alpha for accepting and 1-\alpha for rejecting).  So  our decision is a random variable taking two possible values, hence is a Bernoulli random variable with “parameter” \alpha. And here is the catch: to sample from a discrete distribution (like a Bernoulli distribution) you can use the inverse transform method, which in the special case of a binary decision translates to  (i) sample a pseudo-random number u from a uniform distribution on the interval (0,1), that is u\sim U(0,1) which can be obtained in R via \texttt{runif(1)}, and (ii) if u<\alpha accept x' otherwise reject. That’s it (*).

(*) Well, yes that’s it. However the coding of the implementation can be simplified to be just: if u<\frac{\pi(x')}{\pi(x)}\frac{q(x|x')}{q(x'|x)} then accept x' otherwise reject it. This is because u\in (0,1) hence whenever \pi(x')q(x|x')/\pi(x)q(x'|x) is larger than 1, then certainly u< \pi(x')q(x|x')/\pi(x)q(x'|x).

In conclusion for a single MH step we substitute (and for simplicity here we remove the dependency on the iteration index r we introduced in the previous code)

compute \alpha=\min\biggl\{1,\frac{\pi(x')}{\pi(x)}\frac{q(x|x')}{q(x'|x)}\biggr\} 
accept x' with probability \alpha

with

simulate u\sim U(0,1)
if u<\frac{\pi(x')}{\pi(x)}\frac{q(x|x')}{q(x'|x)}
   then \pi(x):=\pi(x') and x:=x'

2- Do not keep only the accepted draws!

A common mistake is not advancing the iteration counter until we have an accept. This is wrong (hence if you ever coded a MH algorithm using a \texttt{while} loop it is fairly likely you may have acted the wrong way).  In the implementation given in the introduction to this post (the one using the \texttt{for} loop) you may notice that when a proposal is rejected I store the last accepted value as \texttt{draws[r]}:= x_{r-1}.

3- Save and reuse the values at the denominator

If you compare the pseudo-codes suggested above you may have noticed that when a proposal is accepted I “save” \pi(x):=\pi(x'), in addition to storing the accepted proposal x'. While it is not necessary to do the former (and in fact most descriptions of MH in the literature do not mention this part) it is a waste of computer time not to keep in memory the accepted \pi(x'), so that we do not need to recompute it at the next iteration (since it is reused at the denominator of \alpha).

This is especially relevant when MH is used for Bayesian inference, where \pi(x) is a posterior distribution \pi(x|\mathrm{data}) such that \pi(x|\mathrm{data})\propto p(\mathrm{data}|x)p(x) for some “prior” p(x) and “likelihood function” p(\mathrm{data}|x). In fact, it is often the case that the pointwise evaluation of the likelihood is far from being instantaneous, e.g. because the data has thousands of values, or the evaluation of the likelihood itself requires some numerical approximation.

4- Code your calculations on the log-domain.

Depending on cases, \pi(x) may sometimes take very large or very small values. Such extremely large/small values may cause a numeric overflow or an underflow, that is your computing environment will be unable to represent these numbers correctly. For example a standard Gaussian density is strictly positive for all real x, but if you evaluate it at x=40 your software will likely return zero (this is an underflow).

For example, in R type \texttt{dnorm(40)} and you’ll see it returns zero .

Therefore instead of using

if u<\frac{\pi(x')}{\pi(x)}\frac{q(x|x')}{q(x'|x)} accept x' 
   
   otherwise reject

use

if \log(u)<\log\pi(x')-\log \pi(x)+\log q(x|x')-\log q(x'|x) accept x' 
   
   otherwise reject

Anyhow, while this recommendation is valid, it might not be enough without further precautions, as described in the next point.

5- Do as many analytic calculations as feasible in order to prevent overflow/underflow.

Strongly connected to the previous point is coding \log(\pi(x)) without doing your calculations analytically first. That is, I see beginners trying to use the log-domain suggestion from point 4 to evaluate \texttt{log(dnorm(40))}. This will not help because the log will be applied to a function \texttt{dnorm(40)} that has already underflown, and the logarithm of zero is taken.

In R you can use \texttt{dnorm(40,log=TRUE)} to compute \mathrm{logpdf}=-0.5\log(2\pi)-0.5x^2 which is the safe way to evaluate the (normalised) standard Gaussian log-density. For other languages you may need to code the log-density yourself.

However, since we are interested in using MH, we may even just code the unnormalised log-density

\mathrm{logpdf}=-0.5x^2.

where I have omitted the irrelevant (for MH) normalization constant -0.5\log(2\pi). This brings us to the next point.

6- Remove proportionality constants.

An important feature of MH is that it can sample from distributions that are known up to proportionality constants (these constants are terms independent of x). This means that \pi(x) can be an unnormalised density (or an unnormalised probability mass function). As mentioned in the previous point, we do not strictly need to write   \texttt{logpdf}=-0.5\log(2\pi)-0.5x^2 but can get rid of the normalization constant -0.5\log(2\pi) that anyhow simplifies out when taking the ratio \pi(x')/\pi(x).

The simplification of proportionality constants is not just a matter of coding elegance, but can affect the performance of the algorithm. For example if \pi(x) is a Gamma distribution, such as x\sim \mathrm{Gamma}(a,b) then \pi(x) contains Gamma functions \Gamma(a) as part of the proportionality constant. When a is not an integer but is a positive real number the evaluation of \Gamma(a) involves the numerical approximation of an integral. Now, this approximation is not honerous if executed only a few times. However, often MCMC algorithms are run for millions of iterations, and in such cases we can save computational time by removing constants we don’t need anyway. In conclusion, you can safely avoid coding \Gamma(a)  (or its logarithm) as it is independent of x.

7- Simplify out symmetric proposal functions

A symmetric proposal function is such that q(x'|x)\equiv q(x|x') for every x and x'. Therefore symmetric proposals do not need to be coded when evaluating the ratio in \alpha as they simplify out.

For example, assume a multivariate Gaussian proposal function written as (with some abuse of notation) q(x'|x)=\mathcal{N}(x,\Sigma), that is a Gaussian density with mean x and covariance matrix \Sigma. Then we have

q(x'|x)=\frac{1}{(2\pi)^{1/2}|\Sigma|^{1/2}}\exp(-\frac{1}{2}(x'-x)^T\Sigma^{-1}(x'-x))

where |\cdot| denotes is the determinant of a matrix and T denotes transposition. Clearly this q(\cdot) is symmetric because of the quadratic form at the exponent.

8a- Remember the transformation Jacobians!

Most previous recommendations are related to numerical issues in the implementation of MH. However this one alerts you of an actual statistical mistake. I introduce the problem using a specific example.

A Gaussian proposal function q(\cdot|x) is a simple choice that often works well in problems where x is low dimensional (e.g. x is a vector of length up to, say, five elements) and the geometry of \pi(x) is not too complicated. The fact that a Gaussian proposal has support on the entire real space makes it a useful default choice. However, when some elements of x are constrained to “live” in a specific subset of the real space, we should be aware that a significant proportion of proposed values will be rejected (unless we “design” q(\cdot) to intelligently explore only a promising region of the space). Wasting computational time is not something we want right?

For the sake of illustration, let’s assume that x is a positive scalar random variable (see point 8b below for a more general discussion). We could code a Metropolis random walk sampler with Gaussian innovations as

x'=x+\delta \cdot \xi

where the “step length” \delta>0 is chosen by the experimenter and \xi\sim \mathcal{N}(0,1) is sampled from a standard Gaussian. Therefore we have that q(x'|x)= \mathcal{N}(x,\delta^2). Clearly x' will certainly be rejected whenever x'\leq 0 since we are assuming that the support of the targeted \pi(x) is positive.

A simple workaround is to propose \log x' instead of x', that is code

\log x'=\log x+\tilde{\delta}\cdot \xi

for some other appropriate \tilde{\delta}>0. Then we evaluate \pi(x') as \pi(\exp(\log x')), that is we exponentiate the proposal \log x' before “submitting” it to \pi(\cdot). Now if this is all we do, we have a problem.

Thing is we proposed from a Gaussian, and then we thought that exponentiating the proposal is just fine. It is almost fine, but we must realise that the exponential of a Gaussian draw is log-normal distributed and we must account for this fact in the acceptance ratio! In other words we sampled \log x' using a proposal function that here I denote q_1(\cdot) (which is Gaussian) and then by exponentiating we effectively sampled x' from another proposal q_2(\cdot) (which is log-normal).

Therefore, from the definition of log-normal density \mathcal{LN} we have (recall in this example x is scalar)

q_2(x'|x)=\frac{1}{x'{\delta}(2\pi)^{1/2}}\exp(-\frac{1}{2{\delta}^2}(\log x'-\log x)^2)

or equivalently x'\sim \mathcal{LN}(\log x,\delta^2).

Then we notice that this proposal function is not symmetric (see point 7) and \frac{q_2(x|x')}{q_2(x'|x)} = \frac{x'}{x}. Hence the correct sampler does the following: simulate a u\sim U(0,1) and using all the recommendations above we

\text{accept } x' \text{ if } \log u< (\log\pi(x')-\log \pi(x))+(\log x'-\log x)

and reject x' otherwise.

But how would we proceed if we were not given the information that the exponential of a Gaussian variable is a log-normal variable? A simple result on the transformation of random variables will assist us, when the transformation g(\cdot) is an invertible function. Say that for a generic x\sim \mathcal{N}(\mu_x,\sigma_x) we seek for the distribution of y=g(x)=\exp(x), and using the transformation rule linked above we have that y has density function p(y)=\frac{1}{\sigma_x(2\pi)^{1/2}}\exp(-\frac{1}{2\sigma_x^2}(\log y-\mu_x)^2)\times \frac{1}{y}. Notice here 1/y is the “Jacobian of the transformation”.

Therefore, we should not forget the “Jacobians” when transforming the proposed draw. In our example above y is x' and the ratio \frac{x'}{x} in the acceptance probability is the ratio of Jacobians.

8b- Recall the tranformation Jacobian! Part 2

The discussion in 8a above treats the specific case of using a random walk proposal to target a distribution with positive support. How about different type of constraints? My colleague at Lund University Johan Lindström has written a compact and useful reference table, listing common transformations of constrained proposals generated via random walk.

See also an excellent and detailed post by Darren Wilkinson.

9- Check the starting value of \pi(x)

Generally it is not really necessary to place checks for infelicities produced by the MCMC sampler in your code. That is, it is not necessary for each proposed x' to check for complex values or check that \pi(x') does not result into a \texttt{NaN} (not-a-number) or that you do not get, say, \pi(x')=-\infty (yes these things can happen if you code \pi(x) yourself without caution  or when this is the result of some approximation, e.g. the expression for \pi(x) is not analytically known).

In fact in these cases the uniform draw  u in point 1 above will not be considered as smaller than \alpha. So the comparison u < \alpha will evaluate to a logical \texttt{false} statement and the proposed x' will be rejected, as it should be.

However an exception to the above is when we evaluate \pi(x_0) at the starting value x_0. In this case we do want to check that \pi(x_0) is not something unwanted. For example, say that \pi(x_0)=+\infty (or -\infty) and we just put this value at the denominator of the ratio in the definition of \alpha (since this is the starting value it automatically goes into the denominator). Then at the first proposed x' we have (assume \pi(x') is finite)

\displaystyle   \alpha=\min\biggl\{1,\frac{\pi(x')}{+\infty}\frac{q(x|x')}{q(x'|x)}\biggr\} =0 \qquad (\text{and the same for } -\infty \text{ at the denominator}).

Clearly no x' will ever be accepted, therefore the value of \pi(\cdot) at the denominator will stay \infty for all iterations and the sampler will never move away from x_0.

Posted in Uncategorized | Tagged , , , , , , , , , | 1 Comment

Bayes Nordics, an email list

This is an interlude to the series of posts on inference for state space models, which I will restart soon.

I have created an email-list called Bayes Nordics. The goal is to disseminate news on events related to Bayesian analysis, in the Nordic countries.

Workshops, conferences, job openings, courses, even local seminars and meetups. Any Bayes-event happening in the Nordics is welcome.

It is up to the users to provide the content.

Basically, it should become something analogous to what’s being done in the excellent Allstat list (though at a way smaller scale). The intention is to keep the “noise” low. So Bayes Nordics is not a “questions & answers” forum; there are already many good places out there to post questions.

Why Bayes Nordics

I think we all experienced the frustration of missing an excellent seminar happening in a neighbour University department, which we could have attended if we only knew. Or missed registering for an introductory course in Bayesian statistics just because, well, it’s not always easy to find these information. Many events are simply not widely advertised. Say you don’t use meetup (meetup.com), so you don’t know that there is an afterwork meetup where someone talks about probabilistic machine learning and variational Bayes.

You get the point, the list of examples could go on and on.

So, if you get to know of some interesting event/opportunity, just post it by email at Bayes Nordics. Registration is free and the platform is ads-free (as long as Google keeps it this way).

Posted in Uncategorized | Tagged , , , , , , , , , , , , , , , , , , , , , | Leave a comment

Sequential Monte Carlo and the bootstrap filter

In the previous post I have outlined three possibilities for approximating the likelihood function of a state space model (SSM). That post is a required read to follow the topics treated here. I concluded with sequential importance sampling (SIS), which provides a convenient online strategy. However, in some cases SIS suffers from some numerical problems that I will address here. The key concept is the introduction of a resampling step. Then I introduce the simplest example of a sequential Monte Carlo (SMC) algorithm, the bootstrap filter. Recommendations on how to make best use of resampling are discussed.

Sequential Importance Sampling with Resampling

An important result derived in the previous post is the sequential update of the importance weights. That is, by denoting with w_t^i the importance weight for particle x_t^i we have

w_t^i\propto \frac{p(y_t|x_t^i)p(x_t^i|x_{t-1}^i)}{h(x_{t}^i|x_{0:t-1}^i,y_{1:t})} w_{t-1}^i \qquad i=1,...,N.

The weights update allows for an online approximation of the likelihood function of parameters \theta, that we can write as

\begin{aligned}  \hat{p}(y_t|y_{1:t-1};\theta)&=\frac{1}{N}\sum_{i=1}^N w_t^i\\  \hat{p}(y_{1:T}|\theta)&= \prod_{t=1}^T\biggl (\frac{1}{N}\sum_{i=1}^N w_t^i\biggr ).  \end{aligned}

At this point, an approximate maximum likelihood estimate can be obtained by searching for the following (typically using a numerical optimizer)

\hat{\theta} = \arg\max_\theta \hat{p}(y_{1:T}|\theta)

or \hat{p}(y_{1:T}|\theta) could be plugged into a Bayesian procedure for sampling from the posterior distribution \pi(\theta|y_{1:T}) (more on this in future posts).

Actually, while the procedure above can be successful, in general depending on the type of data and the adequacy of the model we employ to “fit” the data, some crucial problems can occur.

Particle degeneracy: the convenient representation of w_t^i as a function of w_{t-1}^i is problematic when particle x_{t-1}^i produces a very small weight w_{t-1}^i. With “very small” I mean that  its numeric value as returned by a computer (floating point representation) is zero, even though its mathematical value is actually positive. This is because computers have a limited ability to represent very small and very large values: if a value is “too small” it will be represented as exactly zero on a computer, and if it is “too large” it will be represented as infinity. The former case, when w_t^i is too small, is said to produce an underflow while a too large weight might produce an overflow. We focus on the underflow, as it is more common. Consider for example the plot below, where at time 44 there is a clear outlier, that is an observation with an unusual value compared to the rest of the data.

tsplot_weight_with_outlier_single_y_simple_4col-xml_graph_cmd1o1

This outlier could have been produced by an unusual statistical fluctuation, and at time t=44 the datum y_{t} might be a realization from a tail of the distribution with  density p(y_{t}|x_{t}). While this is legitimate, numerically it can create issues for the stability of our calculations. Say that p(y_{t}|x_{t})=N(y_t;x_t,0.5^2), that is a Gaussian distribution with mean x_t and standard deviation 0.5. Let’s see what happens if I anticipate some topic treated later on.

Say that I choose as importance density the transition density of the latent process, that is I set h(x_{t}^i|x_{0:t-1}^i,y_{1:t}) \equiv p(x_{t}^i|x_{t-1}^i), then the weight becomes

w_t^i\propto N(y_t;x_t^i,0.5^2)w_{t-1}^i \qquad i=1,...,N

with some abuse of notation. Now, for the given choice of proposal density (and if I have been employing a model appropriate for the data) then at the previous time instant 43, most particles will have values somewhere around  30 (see the y-axis in the figure), and we could expect that if x_{43}^i\approx 30 then most of the particles x_{44}^i generated from p(x_{44}^i|x_{43}^i) should take values not too far from 30. And here is a problem: from the figure we have y_{44}\approx 4 and the computer implementation for the density function of N(y_{44};x_{44}^i,0.5^2) might underflow and return zero for many particles (I will describe some tricks mitigating this problem in a future post). Clearly, all those particles having zero weight doom the values of the descendant particles, because each weight depends on the previous one. A similar scenario could happen even without outliers, if we simulate from a model not appropriate for the data. Clearly, as the time horizon T increases, it is not unlikely to incur in such a problem.

The phenomenon were most particles have very small weights is called particle degeneracy and for many years has prevented these Monte Carlo strategies from being effective. When particle degeneracy occurs the likelihood approximation is poor with a large variance, because the numerical calculation of the underlying integrals is essentially performed by the few particles with non-zero weight.

The striking idea that changed everything is the introduction of a resampling step, whose importance in sequential Monte Carlo was first studied in Gordon et al (1993), based on ideas from Rubin (1987). The idea is simple and ingenious and has revolutionised the practical application of sequential methods. The resulting algorithm is the “sequential importance sampling with resampling” (SISR) but we will just call it a sequential Monte Carlo algorithm.

Resampling

The resampling idea is to get rid in a principled way of the particles with small weight and multiply the particles with large weight. Recall that generating particles is about exploring regions of the space where the integral has most of its mass. Therefore, we want to focus the computational effort on the “promising” parts of the space. This is easily accomplished if we “propagate forward” from the promising particles, which are those with a non-negligible weight w_t^i. We proceed as follows:

  • Normalize the weights to sum to 1, that is compute \tilde{w}_t^i=w_t^i/\sum_{i=1}^N w_t^i.
  • interpret \tilde{w}_t^i as the probability associated to x_t^i in the weighted set \{x_t^i,\tilde{w}_t^i,i=1,...,N\}. If this can help, imagine the particles to be balls contained in an urn: some balls are large (large \tilde{w}_t^i) others are small.
  • Resampling: sample N times with replacement from the weighted set, to generate a new sample of N particles. This means that we put a hand in the urn, extract a ball and record its index i then put the ball back in the urn and repeat the extraction and recording for N times (more on how to implement resampling at the end of this post). Clearly it is more likely to extract large balls than small ones.
  • Replace the old particles with the new ones \{\tilde{x}_t^1,...,\tilde{x}_t^N\}. Basically, we empty the urn then fill it up again with copies of the balls having the recorded indices. Say that we have extracted index i=3 five times, we put in the urn five copies of the ball with index i=3.
  • Reset all the unnormalised weights to w_t^i=1/N (the resampling has destroyed the information on “how” we reached time t).

Since resampling is done with replacement, a particle with a large weight is likely to be drawn multiple times. Particles with very small weights are not likely to be drawn at all. Nice!

The reason why performing a resampling step is not only a numerically convenient trick to overcome (sometimes!) particle degeneracy, but is also probabilistically allowed (i.e. it preserves our goal to obtain an approximation targeting p(y_t|y_{1:t-1};\theta)) is illustrated in a great post by Darren Wilkinson.

Forward propagation

We now move the particles to the next time point t+1, that is we propagate forward the state of the system by simulating particles x_{t+1}^i. And how do we perform this step? We take advantage of the important particles we have just resampled, using the importance density to compute the move  \tilde{x}^i_t \rightarrow x_{t+1}^i.
The important fact is that we often propagate from important particles, since these are appearing several times in the urn, because of the resampling step with replacement. Therefore several of the propagated particles x_{t+1}^i might originate from a common “parent” \tilde{x}_t^i. For illustration see the picture below.
particlefilter

Read the picture from top to bottom: we start with particles having all the same size, which means they have equal weight w_t^i=1/N. Particles are then evaluated on the density p(y_t|x_t^i) depicted with a curve. The particles weight is computed and you can see that some particles have a larger size (large weight), others are smaller and some have “disappeared”, which means that their weight is negligible. At this point the particles are resampled: notice at the resampling stage the largest particle in the figure happens to be resampled three times while others fewer times. Once resampling is performed, all the resampled particles get the same size because all weights are reset to 1/N as described above. Now it is time to propagate the resampled particles forward to time t+1: we create a new generation of particles by moving forward only the ones we have resampled. This means that some of the particles having very small weight at time t will not be propagated to t+1  (notice in the figure, some particles do not have arrows departing from them), and the one that has been resampled three times generates three new particles. This implies that at each generation we still have N particles at our disposal, even though some from the previous generation have “died”. I illustrate the use of the resampling step in the bootstrap filter.

The bootstrap filter

The bootstrap filter is the simplest example of a sequential importance sampling with resampling (SISR) algorithm. It is the “simplest” application of SISR because it assumes h(x_{t}^i|x_{0:t-1}^i,y_{1:t}) \equiv p(x_{t}^i|x_{t-1}^i), that is the law of the Markov process is used as importance sampler to propagate particles forward. This implies the already mentioned simplification for the computation of the weights, and we have

w_t^i= \frac{1}{N} p(y_t|x_t^i;\theta) \qquad i=1,...,N.

Notice that in the equation above we have an equality, instead of a proportionality, since after resampling we set weights to be all equal to 1/N, hence after resampling w_{t-1}^i=1/N.

This approach is certainly very practical and appealing, but it comes at a cost. Generating particles from p(x_{t}^i|x_{t-1}^i) means that these are “blind” to data, since this importance density is unconditional to y_{1:t}. Hence the propagation step does not take advantage of any information carried by the data. In some scenario this produces an inefficient sampling that may result, again, in particle degeneracy. A popular alternative is the auxiliary particle filter. I do not go further into the possible improvements over the bootstrap filter however some literature is given at the end of this post.

So here is the bootstrap filter in detail.

  1. At t=0 (initialize) x^i_0\sim p(x_0) and assign \tilde{w}_0^i=1/N, for all i=1,...,N.
  2. At the current t assume to have the weighted particles \{x_t^i,\tilde{w}_t^i\}.
  3. From the current sample of particles, resample with replacement N times to obtain \{\tilde{x}_t^i,i=1,...,N\}.
  4. Propagate forward x_{t+1}^i\sim p(x_{t+1}|\tilde{x}_{t}^i), for all i=1,...,N.
  5. Compute w_{t+1}^i=p(y_{t+1}|x_{t+1}^i) and normalise weights \tilde{w}_{t+1}^i={w}_{t+1}^i/\sum_{i=1}^N {w}_{t+1}^i.
  6. Set t:=t+1 and if t<T go to step 2 otherwise go to 7.
  7. Return \hat{p}(y_{1:T}|\theta)= \prod_{t=1}^T\biggl (\frac{1}{N}\sum_{i=1}^N w_t^i\biggr ).

Recall that p(x_0) is the unconditional density of an arbitrary initial state x_0; this density is set by the modeller (alternatively, x_0 can be fixed deterministically and then all particles will propagate from a common x_0). Notice in step 5 I wrote w_{t+1}^i=p(y_{t+1}|x_{t+1}^i) instead of w_{t+1}^i=p(y_{t+1}|x_{t+1}^i)/N. The two formulations are completely equivalent as they only differ by a constant which is irrelevant for the purpose of assigning a weight to a particle. Also, since weights are going to be normalised, the 1/N is not really necessary. However if for some reason it is relevant to have a pointwise estimate of the likelihood (as opposed to e.g. optimizing it over \theta), then it is important to recover the constant, and write \hat{p}(y_t|y_{1:t-1};\theta)=\frac{1}{N}\sum_{i=1}^N w_t^i. In step 7 I have explicitly reported the likelihood approximation, even though for parameter inference the product of the normalizing constants 1/N^T can be disregarded.

The seven steps above are the simplest version of a bootstrap filter, where the resampling is performed at every t. However, the resampling adds unwanted variability to the estimate of the likelihood function. This extra variability is not really welcome as it makes the estimate more imprecise (and can affect the performance of the pseudo-marginal algorithms I will describe in future posts).

A standard way to proceed is to resample only when necessary, as given by a measure of potential degeneracy of the sequential Monte Carlo approximation, such as the effective sample size (Liu and Chen, 1998). The effective sample size takes values between 1 and N and at time t is approximated via ESS = 1/\sum_{i=1}^N (\tilde{w}_t^i)^2. If the degeneracy at t is too high, i.e. the ESS is below a specified threshold (say below N/2) then resampling is performed, otherwise no resampling is performed at time t.

Finally notice that SISR (and the bootstrap filter) returns an unbiased estimate of the likelihood function. This is completely unintuitive and not trivial to prove. I will go back to this point in a future post.

Implement resampling

Coding your own version of a resampling scheme should not be necessary: popular statistical software will probably have built-in functions implementing several resampling algorithms. To my knowledge, the four most popular resampling schemes are: residual resampling, stratified resampling, systematic resampling and multinomial resampling. I mentioned above that resampling adds additional unwanted variability to the likelihood approximation. Moreover, different schemes produce different variability. Multinomial resampling is the one that gives the worse performance in terms of added variance, while residual, stratified and systematic resampling are about equivalent, though systematic resampling is often preferred because easier to implement and fast. See Douc et al. 2005 and Hol et al. 2006.

Summary

I have addressed the problem known as “particles degeneracy” affecting sequential importance sampling algorithms. I have introduced the concept of resampling, and when to perform said resampling. This produces a sequential importance sampling resampling (SISR) algorithm. Then I have introduced the simplest example of SISR, the bootstrap filter. Finally, I have briefly mentioned some results pertaining resampling schemes.

We now have a partial (though useful starting point for further self study) introduction to particle filters / sequential Monte Carlo methods for approximating the likelihood function of a state space model. We are now ready to consider Bayesian parameter inference, including practical examples.

Further reading

Posted in Uncategorized | Tagged , , , , , , , , , | Leave a comment

Monte Carlo sampling for likelihood approximation in state space models

In a previous post I have set the problem of estimating parameters for a state-space model (SSM). That post is a required read, also because I set some notation. My final goal is to show how to construct exact Bayesian inference  for the vector of parameters \theta. To achieve this task, I need to introduce special Monte Carlo methods for multi-dimensional integration. This post rapidly screens some notions of Monte Carlo, importance and sequential importance sampling methods for SSM.

We have already learned that SSM have an intractable likelihood. This means that the analytic expression of the likelihood function for the vector of parameters \theta is not known. We can also say that a likelihood function is intractable when it is difficult to approximate, though this concept is kind of vague. What is considered to be “difficult” is relative: let’s say that the integrals involved in said likelihood cannot be solved using standard numerical methods such as quadrature.

As a reminder, for given data y_{1:T} we want to approximate the likelihood function

p(y_{1:T}|\theta)=p(y_1|\theta)\prod_{t=2}^T p(y_t|y_{1:t-1};\theta)

and I have shown that it is possible to write

\begin{aligned}  p(y_{1:T}|\theta)&=\int p(y_{1:T},x_{0:T}|\theta)dx_{0:T} = \int p(y_{1:T}|x_{0:T},\theta)\times p(x_{0:T}|\theta)dx_{0:T}\\  &=\int \prod_{t=1}^T p(y_{t}|x_{t},\theta)\times \biggl\{p(x_0|\theta)\prod_{t=1}^T p(x_{t}|x_{t-1},\theta)\biggr\} dx_{0:T}.  \end{aligned}

So the question is how to find an approximation for this T+1-dimensional integral. We can approach the problem from different angles, all interconnected.

Standard Monte Carlo integration

We can write p(y_{t}|y_{1:t-1},\theta) as

p(y_{t}|y_{1:t-1},\theta) = \int p(y_t|x_t,\theta)p(x_t|y_{1:t-1},\theta)dx_t = \mathbb{E}(p(y_t|x_t,\theta)).

That is the integration problem can be interpreted as taking the expectation of the conditional density p(y_t|x_t,\theta)  with respect to the law p(x_t|y_{1:t-1},\theta). This means writing \mathbb{E}(p(y_t|x_t,\theta))\equiv \mathbb{E}_{Y_t|X_t}(p(y_t|x_t,\theta)). The interpretation of an integral as a probabilistic expectation is at the core of Monte Carlo methods.

It is natural to approximate an expectation using an empirical mean, as follows: using a computer program, generate independently N draws from p(x_t|y_{1:t-1},\theta), then invoke the law of large numbers.

  1. Produce N independent draws x_t^i\sim p(x_t|y_{1:t-1},\theta), \quad i=1,...,N.
  2. For each x_t^i compute p(y_t|x_t^i,\theta). Notice the conditioning on the sampled x_t^i.
  3. By the law of large numbers, we have \frac{1}{N}\sum_{i=1}^N p(y_t|x_t^i,\theta)\approx \mathbb{E}(p(y_t|x_t,\theta)) and the approximation improves for increasing N. In fact we can write \frac{1}{N}\sum_{i=1}^N p(y_t|x_t^i,\theta)\rightarrow \mathbb{E}(p(y_t|x_t,\theta)), \qquad N\rightarrow\infty .

The error term for the approximation is O(N^{-1/2}) regardless the dimension of x_t, which is remarkable. Notice, the convergence property 3 implies that the Monte Carlo estimate is consistent. Another important property, that will be discussed in a  future post, is that the estimate is also unbiased.

The issue here is, how to generate “good” draws (hereafter particles) x_t^i\sim p(x_t|y_{1:t-1},\theta)? Here “good” means that we want particles such that the values of p(y_t|x_t^i,\theta) are not negligible. Since these particles are the points where the integrand function p(y_t|x_t;\theta) gets evaluated, we want those particles that are “important” for the evaluation of the integral, and we wish most of the \{x_t^i\}_{i=1:N} to end up in a region of high probability for p(y_t|x_t;\theta).

It turns out that for SSM sequential Monte Carlo (SMC, or “particle filters”) is the winning strategy.

I will not give a thorough introduction to SMC methods. I will only consider a few notions useful to solve our parameter inference problem. I first consider importance sampling as a useful introduction to SMC.

Importance Sampling

For ease of reading I remove the dependence of quantities on \theta. This is a static parameter (i.e. it does not vary with time), and we can consider as implicit the dependence of all probability densities on \theta.

Consider the following:

\begin{aligned}  p(y_t|y_{1:t-1})&=\int p(y_t|x_{0:t})p(x_{0:t}|y_{1:t-1})dx_{0:t}  =\int p(y_t|x_{0:t})p(x_{0:t-1},x_t|y_{1:t-1})dx_{0:t}\\  &=\int p(y_t|x_{0:t})p(x_t|x_{0:t-1},y_{1:t-1})p(x_{0:t-1}|y_{1:t-1})dx_{0:t} = \int p(y_t|x_t)p(x_t|x_{t-1})p(x_{0:t-1}|y_{1:t-1})dx_{0:t}\\  &= \int \frac{p(y_t|x_t)p(x_t|x_{t-1})p(x_{0:t-1}|y_{1:t-1})}{h(x_{0:t}|y_{1:t})}h(x_{0:t}|y_{1:t})dx_{0:t}  \end{aligned}

where h(x_{0:t}|y_{1:t}) is an arbitrary density function called “importance density” that is non-zero whenever p(x_{0:t}|y_{1:t}) is non-zero (the support of h(x_{0:t}|y_{1:t}) needs to be greater or equal to the support of p(x_{0:t}|y_{1:t})). The purpose of introducing the importance density is to use it as a sampler of random numbers, assuming we don’t know how to sample from the p(x_{0:t-1}|y_{1:t-1}) appearing in the penultimate equality. Therefore we should choose a h(\cdot) easy to simulate from.
Notice the two crucial simplifications occurring in the fourth equality: we have p(x_t|x_{0:t-1},y_{1:t-1})\equiv p(x_t|x_{t-1}) and p(y_t|x_{0:t})\equiv p(y_t|x_t). The first result is from the Markov property defined in the previous post, and the second one states the independence of y_t from the “history of \{X_s\} prior to time twhen conditioning on x_t (see the graph in the previous post). Conditional independence was also stated in the previous post.

In conclusion, the importance sampling approach approximates the integral using the following Monte Carlo procedure:

  • simulate N samples \quad x_{0:t}^i\sim h(x_{0:t}|y_{1:t}) independently, \qquad i=1,...,N.
  • Construct “importance weights” w_t^i=\frac{ p(y_t|x_t^i)p(x_t^i|x_{t-1}^i)p(x_{0:t-1}^i|y_{1:t-1})}{h(x_{0:t}^i|y_{1:t})}.
  • p(y_t|y_{1:t-1})=\mathbb{E}(p(y_t|x_t))\approx \frac{1}{N}\sum_{i=1}^N w_t^i.

Notice that here we wish to simulate N sequences x_{0:t}^i with x_{0:t}^i=\{x_0^i,...,x_t^i\}. Clearly, generating at each time t a “cloud of particles” \{x_{0:t}^i\}_{i=1:N} spanning the entire interval [0,t] is not really computationally appealing, and it is not even clear how to construct the importance density. We need some strategy enabling a sort of sequential mechanism. Moreover, even if we are able to simulate the particles, are we able to evaluate all the densities appearing in w_t^i? For some complex models we might not know the analytic expressions for p(x_{0:t-1}|y_{1:t-1}) and p(x_t|x_{t-1}). This is addressed in next post.

Sequential importance sampling

When the importance density h(\cdot) is chosen intelligently, an important property is the one that allows a sequential update of the importance weights. That is, for an appropriate h(\cdot) we can write

w_t^i \propto \frac{ p(y_t|x_t^i)p(x_t^i|x_{t-1}^i)}{h(x_{t}^i|x_{0:t-1}^i,y_{1:t})}w_{t-1}^i

and this sequential update is a key result to make the computation of the (T+1)-dimensional integral bearable. Here I am going to show how to obtain the weights update above.

First of all, consider that  it is up to the analyst to design a density h(\cdot) that is appropriate for the model at hand (easy to evaluate, easy to sample from, generating “important” particles). For example, let’s write an importance density as the product of two densities h_1(\cdot) and h_2(\cdot):

h(x_{0:t}|y_{1:t})=h_1(x_{t}|x_{0:t-1},y_{1:t})h_2(x_{0:t-1}|y_{1:t-1}).

Notice that while h_1 depends on measurements up to time t, instead h_2 depends on measurements up to t-1. This is because we have freedom in designing h(\cdot), as h(\cdot) is not subject to the constraints imposed on the state space model (Markovianity, conditional independence). For example h_1(x_{t}|x_{0:t-1},y_{1:t}) does not have to result in h_1(x_{t}|x_{t-1}), though this choice is allowed. For simplicity, in the following I remove subscripts for h_1 and h_2 and just write h(x_{0:t}|y_{1:t})=h(x_{t}|x_{0:t-1},y_{1:t})h(x_{0:t-1}|y_{1:t-1}).

I use the decomposition for the importance density to show the sequential update of the importance weights w_t^i, though for ease of writing I remove the superscript i. We have:

w_t=\frac{ p(y_t|x_t)p(x_t|x_{t-1})p(x_{0:t-1}|y_{1:t-1})}{h(x_{0:t}|y_{1:t})}  =\frac{ p(y_t|x_t)p(x_t|x_{t-1})p(x_{0:t-1}|y_{1:t-1})}{h(x_{t}|x_{0:t-1},y_{1:t})h(x_{0:t-1}|y_{1:t-1})}.

Now, note that we can use the Bayes theorem to write the following

\begin{aligned} p(x_{0:t}|y_{1:t}) &= \frac{p(x_{0:t},y_{1:t})}{p(y_{1:t})}\propto p(y_{1:t-1},y_t|x_{0:t})p(x_{0:t})=p(y_t|x_{0:t},y_{1:t-1})p(x_{0:t}|y_{1:t-1})\\  &=p(y_t|x_{t})p(x_{0:t}|y_{1:t-1})=p(y_t|x_{t})p(x_{t}|x_{0:t-1},y_{1:t-1})p(x_{0:t-1}|y_{1:t-1})\\  &=p(y_t|x_{t})p(x_{t}|x_{t-1})p(x_{0:t-1}|y_{1:t-1})  \end{aligned}

and in the derivation I have used both the conditional independence of observations and the Markovianity of the latent state (see the previous post).

We can use this derivation to write p(x_{0:t-1}|y_{1:t-1}), so we can express the weights as

w_t\propto \frac{ p(y_t|x_t)p(x_t|x_{t-1})p(y_{t-1}|x_{t-1})p(x_{t-1}|x_{t-2})p(x_{0:t-2}|y_{1:t-2})}{h(x_{t}|x_{0:t-1},y_{1:t})h(x_{0:t-1}|y_{1:t-1})}=\frac{p(y_t|x_t)p(x_t|x_{t-1})}{h(x_{t}|x_{0:t-1},y_{1:t})}w_{t-1}

which is the sequential update of the importance weights we wanted to prove. The sequential update is convenient, as when the time index t advances in the simulation we do not need to produce particles always starting at time 0, instead we can perform computations online, by only retaining the weights computed at time t-1.

Notice I used the proportionality symbol \propto as we do not need to carry the constant term p(y_{1:t-1}) resulting from the denominator of the Bayes theorem. This term is independent of \theta, hence is not going to be relevant for parameters optimization (maximum likelihood approach) nor in Bayesian sampling from the posterior distribution of \theta.

Once particles are simulated, we have  the following approximations (notation-wise, I now recover the parameter \theta I previously removed):

\begin{aligned}  \hat{p}(y_t|y_{1:t-1};\theta)&=\frac{1}{N}\sum_{i=1}^N w_t^i\\  \hat{p}(y_{1:T}|\theta)&= \prod_{t=1}^T\biggl (\frac{1}{N}\sum_{i=1}^N w_t^i\biggr ).  \end{aligned}

Adding back \theta in the notation is important not to make confusion between the likelihood function {p}(y_{1:T}|\theta) and the “evidence” p(y_{1:T}) ,where the latter is really independent of \theta (it’s the denominator of the Bayes theorem).

In conclusion, I have outlined some approaches to approximate the likelihood function, by assuming the ability to sample from some importance density h(\cdot). However, I have skipped discussing how to construct such density (indeed, this is a major and problem dependent issue), and a simple possibility is covered in the next post.

It is fair to say that, even though we managed to derive the relations above with success, in practice the computation of the required quantities does not always end up with a good likelihood approximation. This is investigated in the next post, together with strategies to overcome numerical issues.

Summary

I have started a quick excursus into Monte Carlo methods for the approximation of the likelihood function for state space models (SSM). I covered a naïve Monte Carlo approach, then importance sampling and finally sequential importance sampling. Each of these topics has been investigated at great length in available literature (there would be so much more to consider, for example quasi-Monte Carlo). However, my focus is to give an idea of possibilities, while quickly moving forward towards introducing the simplest sequential Monte Carlo strategy, which is based on sequential importance sampling. Once sequential Monte Carlo is introduced, I will move to Bayesian parameter estimation for SSM.

Further reading

Possibilities for self study are endless. Here are a couple of excellent and accessible resources:

If you have access to the titles below it’s a plus:

 

 

Posted in Uncategorized | Tagged , , , , , , | 2 Comments

State space models and intractable likelihoods

In this first series of posts, I introduce important tools to construct inference methods for the estimation of parameters \theta in stochastic models. Stochastic models are characterized by randomness in their mathematical nature, and since at first I focus on models having dynamic features, these models are defined by stochastic processes.

I will start by introducing a class of dynamic models known as state space models.

For general (non-linear, non-Gaussian) state space models it is only relatively recently that a class of algorithms for exact parameter inference has been devised, in the Bayesian framework.  In a series of 4-5 posts I will construct the simplest example of this class of pseudo-marginal algorithms, now considered the state-of-art tool for parameter estimation in nonlinear state space models. Pseudo-marginal methods are not exclusively targeting state space models, but are able to produce exact Bayesian inference whenever a positive and unbiased approximation of the likelihood function is available, no matter the underlying model.

I will first define a state space model, then introduce its likelihood function, which turns out to be intractable. I postpone to the next post the construction of Monte Carlo methods for approximating the likelihood function.

State space models

A very important class of models for engineering applications, signal processing, biomathematics, systems biology, ecology etc., is the class of state-space models (SSM). [In some literature the terms SSM and hidden Markov model (HMM) have been used interchangeably, though some sources make the explicit distinction that in HMM states are defined over a discrete space while in SSM states vary over a continuous space.]

Suppose we are interested in modelling a system represented by a (possibly multidimensional) continuous-time stochastic process \{X_t\}_{t\geq 0}, where X_t denotes the state of the system at a time t\geq 0. The notation \{X_t\}_{t\geq 0} denotes the ensemble of possible values taken by the system for a continuous time t\geq 0.

However, in many experimental situations the experimenter does not have access to  measurements from  \{X_t\}_{t\geq 0} but rather to noisy versions corrupted with “measurement error”. In other words the true state of the system is unknown, because  \{X_t\}_{t\geq 0} is latent (unobservable), and we can only get to know something about the system via some noisy measurements. I denote the available measurements (data) with y_1,y_2,...,y_T and use  \{Y_t\}_{t\geq 1} to denote the process producing the actual observations at T discrete time points. For simplicity of notation I assume that measurements are obtained at integer observational times \{1,2...,T\}. Each y_t can be multidimensional (t=1,...,T) but it does not need to have the same dimension of the corresponding X_t, for example some coordinate of  \{X_t\}_{t\geq 0} might be unobserved. Therefore, the only available information we get from our system is rather partial: (i) the system of interest is continuous in time but measurements are obtained at discrete times and (ii) measurements do not reflect the true state of the system  \{X_t\}_{t\geq 0}, because the y_1,...,y_T are affected with some measurement noise. For example we could have y_t=X_t + \varepsilon_t, with \varepsilon_t some random noise.

In general \{X_t\}_{t\geq 0} and \{Y_t\}_{t\geq 1}  can be either continuous– or discrete–valued stochastic processes. However in the following I assume both processes to be defined on continuous spaces.

I use the notation z_{1:T} to denote a sequence \{z_1,...,z_T\}. Therefore, data can be written y_{1:T}. For the continuous time process \{X_t\}_{t\geq 0} I use X_{1:T}=\{X_1,...,X_T\}  to denote realizations of the process at times \{1,2...,T\}. Clearly, none of the X_t values is known.

Assume that the dynamics of  \{X_t\}_{t\geq 0}  are parametrized by a model having a (vector) parameter \theta. The value of \theta is unknown and our goal is to learn something about \theta using available data. That is to say, we wish to produce inference about \theta. I could write  \{X_t(\theta)\}_{t\geq 0}  in place of  \{X_t\}_{t\geq 0}, but this makes the notation a bit heavy.

State space models are characterized by two properties: Markovianity of the latent states and conditional independence of measurements.

Markovianity:  \{X_t\}_{t\geq 0} is assumed a Markov stochastic process, with transition density p(x_t|x_s), for s<t. That is, “given the present state, the past is independent of the future”, so if time s is the “present”, then p(x_t|\text{history of } \{X_u\}_{u\leq s}, y_{1:t-1})\equiv p(x_t|x_s). Also in this case, for simplicity we assume implicit the conditioning on \theta, instead of writing p(x_t|x_s;\theta). Specifically for our inference goals, we are interested in transitions between states corresponding to contiguous (integer) observational times, that is p(x_t|x_{t-1}). Also “the past is independent of the future, given the present”, meaning that p(x_{t-1}|x_{t:T},y_{t:T})=p(x_{t-1}|x_t).

Conditional independence: measurements are assumed conditionally independent given a realization of the corresponding latent state \{X_t\}. That is, p(y_t|x_{0:t},y_{1:t-1})=p(y_t|x_t).

Markovianity and conditional independence can be represented graphically:

fg_hmm

Notice each white node x_t is only able to directly influence the next white node x_{t+1} and y_t. Also, each grey node y_t is unable to influence other measurements directly. [This does not mean observations are independent: for example p(y_2|y_1)=\frac{1}{p(y_1)}\int p(y_1,y_2,x_{1:2})dx_{1:2 }=\frac{1}{p(y_1)}\int p(y_1,y_2|x_{1},x_2)p(x_{1},x_2)dx_{1:2 }=\frac{1}{p(y_1)}\int p(y_2|x_2)p(y_1|x_1)p(x_2|x_1)p(x_1)dx_{1:2} and evidently it results in p(y_2|y_1) \neq p(y_2). If equality did hold y_2 would be independent of y_1.]

Notice x_0 is the initial state of the system at some arbitrary time prior to the first observational time t=1. By convention we can set this arbitrary time to be t=0.

In summary, a compact, probabilistic representation of a state-space model is given by the conditional distribution of the observable variable at t given the latent state, and the distribution of the evolution of the (Markovian) state, that is the transition density. Optionally, the initial state x_0 can be a fixed deterministic value or have its own (unconditional) distribution p(x_0) which might depend or not on \theta.

\begin{aligned}  Y_t | X_t = x_t & \sim p(y_t | x_t;\theta) &\mbox{(Observation density)} \\  X_{t } | X_{t-1} = x_{t-1} & \sim p(x_{t } | x_{t-1};\theta) &\mbox{(Transition density)} \\  X_0 & \sim p(x_0) &\mbox{(Initial distribution)}.  \end{aligned}

Example: Gaussian random walk

A trivial example of a (linear, Gaussian) state space model is

\begin{aligned}  y_t &= x_{t}+e_{t}, \quad e_{t}\sim_{iid} N(0,r^2)\\  x_t &= x_{t-1}+\xi_{t}, \quad \xi_{t}{\sim}_{iid} N(0,q^2),\quad x_0=0.  \end{aligned}

with \theta=(q,r). Therefore

\begin{aligned}  p(y_t|x_{t})&=N(y_t;x_{t},r^2)\\  p(x_t|x_{t-1})&=N(x_t;x_{t-1},q^2).  \end{aligned}

Parameter inference and the likelihood function for SSMs

As anticipated, I intend to cover tools for statistical inference for the vector of parameters \theta, and in particular discuss Bayesian inference methods for SSM.

This requires introducing some quantities:

  • p(y_{1:T}|\theta) is the likelihood function of \theta based on measurements y_{1:T}.
  • In the Bayesian framework \theta is a random quantity and \pi(\theta) is its prior density (I always assume continuous-valued parameters). It encodes our knowledge about \theta before we “see” the current data y_{1:T}.
  • The Bayes theorem gives the posterior distribution of \theta, enclosing uncertainty about \theta for given data:
    \pi(\theta|y_{1:T})=\frac{p(y_{1:T}|\theta)\pi(\theta)}{p(y_{1:T})}.
  • p(y_{1:T}) is the marginal likelihood (evidence), given by p(y_{1:T})= \int p(y_{1:T}|\theta)\pi(\theta)d\theta.
  • inference based on \pi(\theta|y_{1:T}) is called Bayesian inference.

Goal: we wish to produce Bayesian inference on \theta. In principle this involves  writing down the analytic expression of \pi(\theta|y_{1:T}) and study its properties. However, for models of realistic complexity, what is typically performed is some kind of Monte Carlo sampling of pseudo-random draws from the posterior \pi(\theta|y_{1:T}). Then we can have a finite-samples approximation  of the marginal posteriors \pi_j(\theta|y_{1:T})\equiv \pi(\theta_j|y_{1:T}) (j=1,...,\dim(\theta)) compute the sample means of the marginals, quantiles etc. This way we perform uncertainty quantification for all components of \theta, for a given model and available data.

Now, the main problem preventing a straightforward sampling from the posterior, is that for nonlinear, non-Gaussian SSM the likelihood function is not available in closed form nor it is possible to derive a closed-form approximation. It is intractable. Let’s see why.

In a SSM data y_1,...,y_T are not independent, they are only conditionally independent. This means that we cannot write p(y_1,...,y_T|\theta) as a product of unconditional densities: instead we have

p(y_{1:T}|\theta)=p(y_1|\theta)\prod_{t=2}^Tp(y_{t}|y_{1:t-1},\theta)

with the convention p(y_1|y_{1:0},\theta)\equiv p(y_1|\theta).

The problem is that all densities in the expression above are unavailable in closed form, hence unknown. If these were available we could either use an algorithm to find a (local) maximizer \hat{\theta} to p(y_{1:T}|\theta) (the maximum likelihood estimate of \theta), or plug the likelihood into the posterior \pi(\theta|y_{1:T}) and perform Bayesian inference.

The reason for the unavailability of a closed-form expression for the likelihood is the latency of process  \{X_t\}_{t\geq 0}, on which data depend. In fact we have:

\begin{aligned}  p(y_{1:T}|\theta)&=\int p(y_{1:T},x_{0:T}|\theta)dx_{0:T} = \int p(y_{1:T}|x_{0:T},\theta)\times p(x_{0:T}|\theta)dx_{0:T}\\  &=\int \prod_{t=1}^T p(y_{t}|x_{t},\theta)\times \biggl\{p(x_0|\theta)\prod_{t=1}^T p(x_{t}|x_{t-1},\theta)\biggr\} dx_{0:T}.  \end{aligned}

The expression above is intractable for two reasons:

  • it is a (T+1)-dimensional integral, and
  • for most (nontrivial) models, the transition densities p(x_{t}|x_{t-1}) are unknown.

Basically the only way to solve the integral when T gets large is via Monte Carlo methods. A special case amenable for closed-form computation is the linear SSM with Gaussian noise (see the Gaussian random walk example): for this case the Kalman filter can be employed to return exact maximum likelihood inference. In the SSM literature important (Gaussian) approximations are given by the extended and unscented Kalman filters.

However, for nonlinear and non-Gaussian SSM, sequential Monte Carlo methods (or “particle filters”) are presently the state-of-art methodology. Sequential Monte Carlo (SMC) methods will be considered in a future post. SMC is a convenient tool to implement the pseudo-marginal methodology for exact Bayesian inference that I intend to outline in this first series of posts.

Summary

I have introduced minimal notions to set the inference problem for parameters of state space models (SSM). The final goal is to summarize a methodology for exact Bayesian inference, the pseudo-marginal method. This will be outlined in future posts, with focus on SSM. I have also stated some of the computational issues preventing a straightforward implementation of likelihood based methods for the parameters of SSM. In the next two posts I consider some Monte Carlo strategies for approximating the likelihood function.

Further reading

An excellent, accessible introduction to filtering and parameter estimation for state space models (and recommended for self study) is Simo Särkkä (2013) Bayesian Filtering and Smoothing, Cambridge University Press. The author kindly provides free access to a PDF version at his webpage. Companion software is available at the publisher’s page (see the Resources tab).

State space modelling is a fairly standard topic that can be found treated in most (all?) signal processing and filtering references, so it does not make sense to expand the list of references for this post. However, I will add more suggestions in future posts, in connection with the further topics I introduce.

Posted in Uncategorized | Tagged , , , , , | 2 Comments

Welcome!

Welcome to my first blog post! You can read about me in the About section.

I will write about statistical inference methods and algorithms, typically (though not exclusively) for models that have some dynamic component. Posts will reflect personal research interests, notably computer-intensive Monte Carlo methods for parameter inference, (approximate) Bayesian methods and likelihood-free methods such as ABC.

My goal is to offer a simplified (though not simplistic) entry to challenging inference methods for stochastic modelling. For some of my posts, the ideal target are postgraduate students in mathematical statistics, or more in general researchers with an adequate background in linear algebra and stochastic processes, with some ability to code.  I will keep the notation and detail of the exposition at a fairly accessible level, avoiding measure-theoretic constructs and emphasizing computational and applied aspects.

Posted in Uncategorized | Tagged , , , , , , , , , | Leave a comment