Mixed-effects models (MEM) are hierarchical models suited for “population inference”, where instead of fitting data from a single experiment, we are interested in learning characteristics common to runs of the same experiment. As an example, we could have data from subjects, and we wish to fit them all together, not separately. This way we learn something at the “population” level, using a statistical model that explicitly takes into account variation from several streams of data. Mixed-effects models are particularly relevant for repeated measurements data.

In MEM, individual experiments (e.g. “subjects”) are modeled by introducing some set of parameters (), which vary randomly between experiments according to a probability distribution, say , depending on some unknown parameter . Here is common to all , hence common to all subjects. Therefore is a “random effect”, and is a “population parameter”. Both and may be vectors. Then there could be other unknown parameters which are not random effects, and we call those .

So we have *mixed-effects*, as some parameters vary randomly between subjects (, ) and others are common to all subjects ( and ).

Example:

where is the -th measurement recorded on subject (), may be corresponding fixed covariates or even an unobserved stochastic process, and is some function. Finally is residual variation (measurement error), for example Gaussian error .

When the result from discrete observations of a diffusion process , which is a continuous-time Markov process solution to the stochastic differential equation (SDE) for subject

then we have obtained a* stochastic differential equation mixed-effects model *(SDEMEM). Some papers consider SDEMEMs written as (2) (no measurement error) others consider it as the system (1)-(2).

This is a powerful class of models, since we can simultaneously consider three sources of variability: (i) variation between subjects: by estimating the parameter underlying the distribution of the individual random effects ; (ii) intrinsic individual stochastic variation, encoded within the “diffusion coefficient” ; (iii) measurement error variation (if (1) is considered). Therefore SDEMEMs enable us to learn characteristics common to all subjects (i.e. “population estimation”), while also taking into account individual systemic (intrinsic) variation and measurement error.

Now, inference for SDEMEMs is not trivial at all, essentially because inference for SDEs based on discrete observations can be quite tricky. However some literature is available, and the motivation for writing this post is to make reasearchers aware of a collection of resources for SDEMEMs I have set up.

Please contact me to help with the list, if you find some reference is missing.

]]>

I have decided to write about pseudo-marginal MCMC methods since the original literature introducing these methods can be fairly intimidating. Though this need not be. I will show how it is very simple to prove the validity of the approach. Theoretical difficulties *may* arise depending on the type of pseudo-marginal method we choose to consider for a specific case study. However the basic idea is fairly simple to describe and I proceed at a slow step-by-step pace. Other accessible intro to the topic are in the Resources section at the end of this post.

Suppose we wish to use Markov chain Monte Carlo (MCMC) to sample from a distribution having probability density (probability mass) function . Suppose that is not available in closed form and that its pointwise evaluation is not possible at any given . Finally, suppose that a non-negative and “unbiased” random estimator of is available (we define what we mean with unbiased estimator further below), and we denote this estimator with . Then an intuitive thing to do is to use in place of into a Metropolis-Hastings algorithm. This way the (approximate) Metropolis-Hastings ratio for a proposal becomes (see this introductory post to Metropolis-Hastings if necessary)

and is accepted with probability . Although it is not evident yet, what we have just written is** an example of a pseudo-marginal method, but I will clarify this point below. **Therefore in practice applying the method is relatively simple.

The** astonishing result** provided by a pseudo-marginal method is that the Markov chain constructed as above — where proposals are accepted according to instead of the true and unavailable acceptance ratio (where uses the exact ) — has as stationary distribution, the “exact” target. Therefore, we can use the approximate into an MCMC sampler, while still obtaining exact Monte Carlo sampling from .

In this post whenever we write “exact sampling” we mean “exact up to Monte Carlo error”, where the latter can be arbitrarily decreased by increasing the number of MCMC iterations.

We want to show that the Markov chain consisting in proposals accepted with probability has stationary distribution , that is as the generated chain is sampled from , with the number of MCMC iterations.

First of all we need to obtain a non-negative and unbiased random estimator of . We now take the opportunity to define what we mean with “unbiased estimator” in this context.

An unbiased estimator of should be such that for every , where denotes expectation. However which probability measure do we consider for this expectation? Answering this question shows why the topic of this post is particularly relevant for sampling from *intractable distributions *or, in Bayesian inference, sampling from posterior distributions having *intractable likelihoods*.

Before trying to find unbiased estimators let’s first see what unbiasedness mean when using Monte Carlo sampling.

It is often the case that while is unknown in closed form, it is instead easy to write the joint distribution of and some other random quantity we are not really interested in (we call it a “disturbance parameter” or an “auxiliary variable”). Let’s be their joint distribution. Assume the domain of the marginalized to be continuous (without loss of generality), then marginalization implies integration and

where basically the “unwanted” source of variability is integrated out. But then again **we assume that analytic integration is out of reach**.

A possibility to approximate the integral above is given by Monte Carlo methods, resulting into a stochastic approximation that we show to be unbiased. We simulate iid samples from the distribution of which we denote with

The second identity rewrites the joint distribution using both conditional probabilities and . Finally we carry out Monte Carlo integration. Notice, we do not need to know the expression of , but we need to be able to simulate from it. Clearly, the above assumes that we are able to evaluate at every (this construction naturally applies to state-space models, when we consider as “data” and the as “particles”).

From the second identity it should be clear that can be considered as the expectation of taken with respect to , as we have

and the notation emphasizes this fact.

Finally, it is easy to show that the Monte Carlo approximation gives an unbiased estimator of , as for any simulated we have shown that and therefore by denoting with the approximated density returned by Monte Carlo we define

and

this holding for any and *for any number of samples* . Also, the estimate is clearly non-negative.

Having discussed unbiasedness for simple Monte Carlo it is now easy to show how to use this fact for statistical inference.

In the previous section we have considered a generic . Here we use Metropolis-Hastings in a Bayesian context to simulate from a specific distribution, the posterior distribution for a parameter , given data . From Bayes’ theorem we have , with the prior of , the likelihood function and the “evidence” (if you are interested in state-space models you might want to check an earlier post).

In most realistic modelling scenario, the likelihood function is unavailable in closed form and we say that the statistical model we are considering has an “intractable likelihood”. However it might be possible to obtain an unbiased non-negative approximation of the likelihood . We show that in this case it is also possible to obtain samples from , that is from the exact posterior of .

The previous section used Monte Carlo integration, by independently sampling from a distribution employing , an “auxiliary” variable we do not care for. Also in this case we consider sampling from an “augmented” space, which is the space over which the pair is defined.

Therefore we now artifically sample from the augmented approximate posterior (it is approximate because we have an approximate likelihood) even if we are ultimately interested in . In this context, in order to write the joint (augmented) posterior we need to introduce the prior for , which we call . Recall we are not interested in conducting inference for , therefore the definition of its prior is only guided by convenience as we are required to be able to sample from . Also, we assume and a-priori independent, that is their joint prior can be written as .

As mentioned above, we assume the availability of a non-negative unbiased approximation to the likelihood which we denote . By introducing the usual auxiliary variable, this approximation can be written as

.

Notice we took bacause of the assumed a-priori independence between and .

Then we require the key assumption of unbiasedness. Again we consider as an -expectation and we want that

.

Now that we have set these requirements on the likelihood function, we proceed to write the approximate augmented posterior:

[Notice we put not at the denominator: this follows from the unbiasedness assumption as we have and the latter is by definition the evidence .]

We know the exact (unavailable) posterior of is

therefore the marginal likelihood (evidence) is

and then we plug the evidence into :

Now, we know that applying an MCMC targeting then discarding the output pertaining to corresponds to integrating-out from the posterior, that is marginalize out. Therefore, assuming we use MCMC to perform such marginalization, the draws we are returned with have the following stationary distribution

.

Notice we have recovered *the exact posterior of* .

We have therefore performed a *pseudo-marginal* approach: “marginal” because we disregard ; *pseudo* because we use not .

In conclusion we have shown that using MCMC on an (artificially) augmented posterior, then discard from the output all the auxiliary variates we generated during the entire MCMC procedure, implies that the sampled parameter draws are the product of an * exact Bayesian inference procedure for *.

Besides the theoretical construction we have described, in practice all that is needed is running a regular MCMC for , without caring of storing the generated samples for .

Pseudo-marginal MCMC methods are due to Beaumont (2003) and Andrieu and Roberts (2009) and have revolutionised the application of Bayesian inference for models having intractable likelihoods. In my opinion they constitute one of the most important statistical advancements of the last thirty years.

In particular, the methodology has found an enormous success in inference for state-space models, as sequential Monte Carlo produces unbiased approximations of the likelihood for any number of particles (though unbiasedness is not trivial to prove in this case, but see Pitt et al and Proposition 9.4.1, page 301 of Del Moral (2004)). It is even possible to obtain exact inference simultaneously for the states and the parameters of state-space models.

However, as shown above, the generality of the approach makes pseudo-marginal methods appealing for a wide range of scenarios, including approximate Bayesian computation and other methods for likelihood-free inference such as synthetic likelihoods.

Of extreme importance is the fact that the results presented above hold irrespectively of how we have obtained the unbiased approximation of the likelihood. If has been approximated via Monte Carlo using a “large” number of auxiliary variates or a “small” , the theoretical result still hold and we sample exactly from regardless of . However, for practical applications tuning appropriately is important, as its value influence the precision of the likelihood approximation which in turn has an effect on the mixing of the chain. See this post and references therein.

- Dahlin and Schön 2017: intro paper focussed on state-space models and computer implementation in R;
- D. Wilkinson’s blog post on particle marginal methods for parameter inference;
- D. Wilkinson’s blog post on particle marginal methods for parameter and state inference in state-space models.
- Fasiolo et al. 2016 : paper comparing several algorithmic strategies for intractable likelihoods.
- Technical papers on tuning particle marginal algorithms: Doucet et al. 2012; Pitt et al 2012; Sherlock et al. 2015; Sherlock 2016;
- Some demo software: my Matlab example; my R code (see the pomp_ricker-pmcmc file);
- Some serious R software: the pomp package (function pmcmc); smfsb (type demo(“PMCMC”) at command line);
- More serious software for state-space models: LibBi; Biips (not sure if this one is actively maintained).

]]>

Notice, this post *does not discuss the design* of a Metropolis-Hastings sampler: that is important issues affecting MCMC, such as bad mixing of the chain, multimodality, or the construction of “ad hoc” proposal distributions are not considered.

I am assuming the reader already knows what a Metropolis-Hastings algorithm is, what it is useful for, and that she has already seen (and perhaps tried coding) some implementation of this algorithm. Even given this background knowledge, there are some subtleties that are good to know and that I will discuss below.

Briefly, what we wish to accomplish with Markov chain Monte Carlo (MCMC) is to sample (approximately) from a distribution having density or probability mass function . Here we focus on the Metropolis-Hastings algorithm (MH). In order to do so, assume that our Markov chain is currently in “state” , and we wish to propose a move . This move is generated from a proposal kernel . We write the proposed value as . Then is accepted with probability or rejected with probability where

Then is called *acceptance probability.*

A *very simplified* pseudo-algorithm for a single step of MH looks like:

propose accept with probability otherwise reject

However, since we wish to sample say times from , we better store the result of our simulations into a matrix and write (assuming an arbitrary initial state )

for r = 1 to R propose compute accept with probability then store it into draws[r]:= set , set otherwise store draws[r]:= end

Here means “copy the value at the right-hand-side into the left-hand-side”. After a few (sometimes many) iterations the chain reaches stationarity, and its stationary distribution happens to have as probability density (or probability mass function). In the end we will explore the content of and disregard some of the initial values, to wash-away the influence of the arbitrary .

The very first time I heard of MH (self study and no guidance available) was in 2008. So I thought “ok let’s google some article!” and I recall that *all* resources I happened to screen were simply stating (just as I wrote above) “accept the move with probability and reject otherwise”. Now, if you are new to stochastic simulation the quoted sentence looks quite mysterious. The detail that was missing in my literature search was the logical link between the concept of acceptance probability and that of sampling.

Here is the link: you have to randomly decide between two possible outcomes of a random decision: *accept* or *reject* a move, and each of these two outcomes has its own probability ( for accepting and for rejecting). So our decision is a random variable taking two possible values, hence is a Bernoulli random variable with “parameter” . And here is the catch: to sample from a discrete distribution (like a Bernoulli distribution) you can use the inverse transform method, which in the special case of a binary decision translates to (i) sample a pseudo-random number from a uniform distribution on the interval , that is which can be obtained in R via , and (ii) if accept otherwise reject. That’s it (*).

(*) Well, yes that’s it. However the coding of the implementation can be simplified to be just: if then accept otherwise reject it. This is because hence whenever is larger than 1, then certainly .

In conclusion for a single MH step we substitute (and for simplicity here we remove the dependency on the iteration index we introduced in the previous code)

compute accept with probability

with

simulate if then and

A common mistake is not advancing the iteration counter until we have an accept. This is wrong (hence if you ever coded a MH algorithm using a loop it is fairly likely you may have acted the wrong way). In the implementation given in the introduction to this post (the one using the loop) you may notice that when a proposal is rejected I store the last accepted value as .

If you compare the pseudo-codes suggested above you may have noticed that when a proposal is accepted I “save” , in addition to storing the accepted proposal . While it is not necessary to do the former (and in fact most descriptions of MH in the literature do not mention this part) it is a waste of computer time not to keep in memory the accepted , so that we do not need to recompute it at the next iteration (since it is reused at the denominator of ).

This is especially relevant when MH is used for Bayesian inference, where is a posterior distribution such that for some “prior” and “likelihood function” . In fact, it is often the case that the pointwise evaluation of the likelihood is far from being instantaneous, e.g. because the data has thousands of values, or the evaluation of the likelihood itself requires some numerical approximation.

Depending on cases, may sometimes take very large or very small values. Such extremely large/small values may cause a numeric overflow or an underflow, that is your computing environment will be unable to represent these numbers correctly. For example a standard Gaussian density is strictly positive for all real , but if you evaluate it at your software will likely return zero (this is an underflow).

For example, in R type and you’ll see it returns zero .

Therefore **instead of using**

if accept otherwise reject

**use**

if accept otherwise reject

Anyhow, while this recommendation is valid, **it might not be enough** without further precautions, as described in the next point.

Strongly connected to the previous point is coding without doing your calculations analytically first. That is, I see beginners trying to use the log-domain suggestion from point 4 to evaluate . This will not help because the log will be applied to a function that has already underflown, and the logarithm of zero is taken.

In R you can use to compute which is the safe way to evaluate the (normalised) standard Gaussian log-density. For other languages you may need to code the log-density yourself.

However, since we are interested in using MH, we may even just code the *unnormalised* log-density

where I have omitted the irrelevant (for MH) normalization constant . This brings us to the next point.

An important feature of MH is that it can sample from distributions that are known up to proportionality constants (these constants are terms independent of ). This means that can be an unnormalised density (or an unnormalised probability mass function). As mentioned in the previous point, we do not strictly need to write but can get rid of the normalization constant that anyhow simplifies out when taking the ratio .

The simplification of proportionality constants is not just a matter of coding elegance, but can affect the performance of the algorithm. For example if is a Gamma *distribution*, such as then contains Gamma *functions * as part of the proportionality constant. When is not an integer but is a positive real number the evaluation of involves the numerical approximation of an integral. Now, this approximation is not honerous if executed only a few times. However, often MCMC algorithms are run for *millions* of iterations, and in such cases we can save computational time by removing constants we don’t need anyway. In conclusion, you can safely avoid coding (or its logarithm) as it is independent of .

A symmetric proposal function is such that for every and . Therefore symmetric proposals do not need to be coded when evaluating the ratio in as they simplify out.

For example, assume a multivariate Gaussian proposal function written as (with some abuse of notation) , that is a Gaussian density with mean and covariance matrix . Then we have

where denotes is the determinant of a matrix and denotes transposition. Clearly this is symmetric because of the quadratic form at the exponent.

Most previous recommendations are related to numerical issues in the implementation of MH. However this one alerts you of an actual statistical mistake. I introduce the problem using a specific example.

A Gaussian proposal function is a simple choice that often works well in problems where is low dimensional (e.g. is a vector of length up to, say, five elements) and the geometry of is not too complicated. The fact that a Gaussian proposal has support on the entire real space makes it a useful default choice. However, when some elements of are constrained to “live” in a specific subset of the real space, we should be aware that a significant proportion of proposed values will be rejected (unless we “design” to intelligently explore only a promising region of the space). Wasting computational time is not something we want right?

For the sake of illustration, let’s assume that is a positive scalar random variable (see point 8b below for a more general discussion). We could code a Metropolis random walk sampler with Gaussian *innovations* as

where the “step length” is chosen by the experimenter and is sampled from a standard Gaussian. Therefore we have that . Clearly will certainly be rejected whenever since we are assuming that the support of the targeted is positive.

A simple workaround is to propose instead of , that is code

for some other appropriate . Then we evaluate as , that is we exponentiate the proposal before “submitting” it to . Now **if this is all we do, we have a problem**.

Thing is we proposed from a Gaussian, and then we thought that exponentiating the proposal is just fine. It is *almost* fine, but we must realise that the exponential of a Gaussian draw is log-normal distributed and we must account for this fact in the acceptance ratio! In other words we sampled using a proposal function that here I denote (which is Gaussian) and then by exponentiating we effectively sampled from another proposal (which is log-normal).

Therefore, from the definition of log-normal density we have (recall in this example is scalar)

or equivalently .

Then we notice that this proposal function is *not symmetric* (see point 7) and . Hence the correct sampler does the following: simulate a and using all the recommendations above we

and reject otherwise.

But how would we proceed if we were not given the information that the exponential of a Gaussian variable is a log-normal variable? A simple result on the transformation of random variables will assist us, when the transformation is an invertible function. Say that for a generic we seek for the distribution of , and using the transformation rule linked above we have that has density function . Notice here is the “Jacobian of the transformation”.

Therefore, we should not forget the “Jacobians” when transforming the proposed draw. In our example above is and the ratio in the acceptance probability is the ratio of Jacobians.

The discussion in 8a above treats the specific case of using a random walk proposal to target a distribution with positive support. How about different type of constraints? My colleague at Lund University Johan Lindström has written a compact and useful reference table, listing common transformations of constrained proposals generated via random walk.

See also an excellent and detailed post by Darren Wilkinson.

**9- Check the starting value of **

Generally it is not really necessary to place checks for infelicities produced by the MCMC sampler in your code. That is, it is not necessary for each proposed to check for complex values or check that does not result into a (not-a-number) or that you do not get, say, (yes these things can happen if you code yourself without caution or when this is the result of some approximation, e.g. the expression for is not analytically known).

In fact in these cases the uniform draw in point 1 above will not be considered as smaller than . So the comparison will evaluate to a logical statement and the proposed will be rejected, as it should be.

However an exception to the above is when we evaluate at the starting value . In this case we do want to check that is not something unwanted. For example, say that (or ) and we just put this value at the denominator of the ratio in the definition of (since this is the starting value it automatically goes into the denominator). Then at the first proposed we have (assume is finite)

Clearly no will ever be accepted, therefore the value of at the denominator will stay for all iterations and the sampler will never move away from .

]]>I have created an email-list called Bayes Nordics. The goal is to disseminate news on events related to Bayesian analysis, in the Nordic countries.

Workshops, conferences, job openings, courses, even local seminars and meetups. Any Bayes-event happening in the Nordics is welcome.

It is up to the users to provide the content.

Basically, it should become something analogous to what’s being done in the excellent Allstat list (though at a way smaller scale). The intention is to keep the “noise” low. So Bayes Nordics is not a “questions & answers” forum; there are already many good places out there to post questions.

I think we all experienced the frustration of missing an excellent seminar happening in a neighbour University department, which we could have attended if we only knew. Or missed registering for an introductory course in Bayesian statistics just because, well, it’s not always easy to find these information. Many events are simply not widely advertised. Say you don’t use meetup (meetup.com), so you don’t know that there is an afterwork meetup where someone talks about probabilistic machine learning and variational Bayes.

You get the point, the list of examples could go on and on.

So, if you get to know of some interesting event/opportunity, just post it by email at Bayes Nordics. Registration is free and the platform is ads-free (as long as Google keeps it this way).

]]>An important result derived in the previous post is the sequential update of the importance weights. That is, by denoting with the importance weight for particle we have

The weights update allows for an online approximation of the likelihood function of parameters , that we can write as

At this point, an approximate maximum likelihood estimate can be obtained by searching for the following (typically using a numerical optimizer)

or could be plugged into a Bayesian procedure for sampling from the posterior distribution (more on this in future posts).

Actually, while the procedure above can be successful, in general depending on the type of data and the adequacy of the model we employ to “fit” the data, some crucial problems can occur.

**Particle degeneracy:** the convenient representation of as a function of is problematic when particle produces a very small weight . With “very small” I mean that its numeric value as returned by a computer (floating point representation) is zero, even though its mathematical value is actually positive. This is because computers have a limited ability to represent very small and very large values: if a value is “too small” it will be represented as exactly zero on a computer, and if it is “too large” it will be represented as infinity. The former case, when is too small, is said to produce an underflow while a too large weight might produce an overflow. We focus on the underflow, as it is more common. Consider for example the plot below, where at time 44 there is a clear outlier, that is an observation with an unusual value compared to the rest of the data.

This outlier could have been produced by an unusual statistical fluctuation, and at time the datum might be a realization from a tail of the distribution with density . While this is legitimate, numerically it can create issues for the stability of our calculations. Say that , that is a Gaussian distribution with mean and standard deviation 0.5. Let’s see what happens if I anticipate some topic treated later on.

Say that I choose as importance density the transition density of the latent process, that is I set , then the weight becomes

with some abuse of notation. Now, for the given choice of proposal density (and if I have been employing a model appropriate for the data) then at the previous time instant 43, most particles will have values somewhere around 30 (see the y-axis in the figure), and we could expect that if then most of the particles generated from should take values not too far from 30. And here is a problem: from the figure we have and the computer implementation for the density function of might underflow and return zero for many particles (I will describe some tricks mitigating this problem in a future post). Clearly, all those particles having zero weight doom the values of the *descendant* particles, because each weight depends on the previous one. A similar scenario could happen even without outliers, if we simulate from a model not appropriate for the data. Clearly, as the time horizon increases, it is not unlikely to incur in such a problem.

The phenomenon were most particles have very small weights is called *particle degeneracy* and for many years has prevented these Monte Carlo strategies from being effective. When particle degeneracy occurs the likelihood approximation is poor with a large variance, because the numerical calculation of the underlying integrals is essentially performed by the few particles with non-zero weight.

The striking idea that changed everything is the introduction of a *resampling* step, whose importance in sequential Monte Carlo was first studied in Gordon et al (1993), based on ideas from Rubin (1987). The idea is simple and ingenious and has revolutionised the practical application of sequential methods. The resulting algorithm is the “sequential importance sampling with resampling” (SISR) but we will just call it a sequential Monte Carlo algorithm.

The resampling idea is to get rid in a principled way of the particles with small weight and multiply the particles with large weight. Recall that generating particles is about exploring regions of the space where the integral has most of its mass. Therefore, we want to focus the computational effort on the “promising” parts of the space. This is easily accomplished if we “propagate forward” from the promising particles, which are those with a non-negligible weight . We proceed as follows:

- Normalize the weights to sum to 1, that is compute .
- interpret as the probability associated to in the
*weighted set*. If this can help, imagine the particles to be balls contained in an urn: some balls are large (large ) others are small. - Resampling: sample times with
*replacement f*rom the weighted set, to generate a new sample of particles. This means that we put a hand in the urn, extract a ball and record its index then put the ball back in the urn and repeat the extraction and recording for times (more on how to implement resampling at the end of this post). Clearly it is more likely to extract large balls than small ones. - Replace the old particles with the new ones . Basically, we empty the urn then fill it up again with copies of the balls having the recorded indices. Say that we have extracted index five times, we put in the urn five copies of the ball with index .
- Reset all the unnormalised weights to (the resampling has destroyed the information on “how” we reached time ).

Since resampling is done with replacement, a particle with a large weight is likely to be drawn multiple times. Particles with very small weights are not likely to be drawn at all. Nice!

The reason why performing a resampling step is not only a numerically convenient trick to overcome (sometimes!) particle degeneracy, but is also *probabilistically* allowed (i.e. it preserves our goal to obtain an approximation targeting ) is illustrated in a great post by Darren Wilkinson.

We now move the particles to the next time point , that is we *propagate forward* the state of the system by simulating particles . And how do we perform this step? We take advantage of the important particles we have just resampled, using the importance density to compute the move

The important fact is that we often propagate from important particles, since these are appearing several times in the urn, because of the resampling step with replacement. Therefore several of the propagated particles might originate from a common “parent” . For illustration see the picture below.

Read the picture from top to bottom: we start with particles having all the same size, which means they have equal weight . Particles are then evaluated on the density depicted with a curve. The particles weight is computed and you can see that some particles have a larger size (large weight), others are smaller and some have “disappeared”, which means that their weight is negligible. At this point the particles are resampled: notice at the resampling stage the largest particle in the figure happens to be resampled three times while others fewer times. Once resampling is performed, all the resampled particles get the same size because all weights are reset to as described above. Now it is time to propagate the resampled particles forward to time : we create a new generation of particles by moving forward **only** the ones we have resampled. This means that some of the particles having very small weight at time will not be propagated to (notice in the figure, some particles do not have arrows departing from them), and the one that has been resampled three times generates three new particles. This implies that at each generation we still have particles at our disposal, even though some from the previous generation have “died”. I illustrate the use of the resampling step in the bootstrap filter.

The bootstrap filter is the simplest example of a sequential importance sampling with resampling (SISR) algorithm. It is the “simplest” application of SISR because it assumes , that is the law of the Markov process is used as importance sampler to *propagate* particles forward. This implies the already mentioned simplification for the computation of the weights, and we have

Notice that in the equation above we have an equality, instead of a proportionality, since after resampling we set weights to be all equal to , hence after resampling .

This approach is certainly very practical and appealing, but it comes at a cost. Generating particles from means that these are “blind” to data, since this importance density is unconditional to . Hence the propagation step does not take advantage of any information carried by the data. In some scenario this produces an inefficient sampling that may result, again, in particle degeneracy. A popular alternative is the auxiliary particle filter. I do not go further into the possible improvements over the bootstrap filter however some literature is given at the end of this post.

So here is the bootstrap filter in detail.

- At (initialize) and assign , for all .
- At the current assume to have the weighted particles .
- From the current sample of particles, resample with replacement times to obtain .
- Propagate forward , for all .
- Compute and normalise weights .
- Set and if go to step 2 otherwise go to 7.
- Return .

Recall that is the unconditional density of an arbitrary initial state ; this density is set by the modeller (alternatively, can be fixed deterministically and then all particles will propagate from a common ). Notice in step 5 I wrote instead of . The two formulations are completely equivalent as they only differ by a constant which is irrelevant for the purpose of assigning a weight to a particle. Also, since weights are going to be normalised, the is not really necessary. However if for some reason it is relevant to have a pointwise estimate of the likelihood (as opposed to e.g. optimizing it over ), then it is important to recover the constant, and write . In step 7 I have explicitly reported the likelihood approximation, even though for parameter inference the product of the normalizing constants can be disregarded.

The seven steps above are the simplest version of a bootstrap filter, where the resampling is performed at every . However, the resampling adds unwanted variability to the estimate of the likelihood function. This extra variability is not really welcome as it makes the estimate more imprecise (and can affect the performance of the pseudo-marginal algorithms I will describe in future posts).

A standard way to proceed is to resample only when necessary, as given by a measure of potential degeneracy of the sequential Monte Carlo approximation, such as the *effective sample size* (Liu and Chen, 1998). The effective sample size takes values between 1 and and at time is approximated via . If the degeneracy at is too high, i.e. the ESS is below a specified threshold (say below ) then resampling is performed, otherwise no resampling is performed at time .

Finally notice that SISR (and the bootstrap filter) returns an *unbiased* estimate of the likelihood function. This is completely unintuitive and not trivial to prove. I will go back to this point in a future post.

Coding your own version of a resampling scheme should not be necessary: popular statistical software will probably have built-in functions implementing several resampling algorithms. To my knowledge, the four most popular resampling schemes are: residual resampling, stratified resampling, systematic resampling and multinomial resampling. I mentioned above that resampling adds additional unwanted variability to the likelihood approximation. Moreover, different schemes produce different variability. Multinomial resampling is the one that gives the worse performance in terms of added variance, while residual, stratified and systematic resampling are about equivalent, though systematic resampling is often preferred because easier to implement and fast. See Douc et al. 2005 and Hol et al. 2006.

I have addressed the problem known as “particles degeneracy” affecting sequential importance sampling algorithms. I have introduced the concept of resampling, and when to perform said resampling. This produces a sequential importance sampling resampling (SISR) algorithm. Then I have introduced the simplest example of SISR, the bootstrap filter. Finally, I have briefly mentioned some results pertaining resampling schemes.

We now have a partial (though useful starting point for further self study) introduction to particle filters / sequential Monte Carlo methods for approximating the likelihood function of a state space model. We are now ready to consider Bayesian parameter inference, including practical examples.

- Darren Wilkinson’s blog is a great resource. I recommend following his blog for his insightful and clearly written posts. In this case see his description of importance sampling, resampling and the bootstrap filter.
- Chapter 7 in Simo Särkkä (2013)
*Bayesian Filtering and Smoothing,*Cambridge University Press. Notice a PDF version is**freely available**at the author’s webpage. Companion software is available at the publisher’s page (see the Resources tab). - Review paper by Doucet and Johansen (2008). A Tutorial on Particle Filtering and Smoothing: Fifteen years later.

We have already learned that SSM have an *intractable likelihood. *This means that the analytic expression of the likelihood function for the vector of parameters is not known. We can also say that a likelihood function is intractable when it is difficult to approximate, though this concept is kind of vague. What is considered to be “difficult” is relative: let’s say that the integrals involved in said likelihood cannot be solved using standard numerical methods such as quadrature.

As a reminder, for given data we want to approximate the likelihood function

and I have shown that it is possible to write

So the question is how to find an approximation for this -dimensional integral. We can approach the problem from different angles, all interconnected.

We can write as

That is the integration problem can be interpreted as taking the expectation of the conditional density with respect to the law . This means writing . The interpretation of an integral as a probabilistic expectation is at the core of Monte Carlo methods.

It is natural to approximate an expectation using an empirical mean, as follows: using a computer program, generate independently draws from , then invoke the law of large numbers.

- Produce independent draws , .
- For each compute . Notice the conditioning on the sampled .
- By the law of large numbers, we have and the approximation improves for increasing . In fact we can write .

The error term for the approximation is *regardless the dimension of* , which is remarkable. Notice, the convergence property 3 implies that the Monte Carlo estimate is *consistent*. Another important property, that will be discussed in a future post, is that the estimate is also *unbiased*.

The issue here is, how to generate “good” draws (hereafter *particles*) ? Here “good” means that we want particles such that the values of are not negligible. Since these particles are the points where the integrand function gets evaluated, we want those particles that are “important” for the evaluation of the integral, and we wish most of the to end up in a region of high probability for .

It turns out that for SSM sequential Monte Carlo (SMC, or “particle filters”) is the winning strategy.

I will not give a thorough introduction to SMC methods. I will only consider a few notions useful to solve our parameter inference problem. I first consider importance sampling as a useful introduction to SMC.

For ease of reading I remove the dependence of quantities on . This is a static parameter (i.e. it does not vary with time), and we can consider as implicit the dependence of all probability densities on .

Consider the following:

where is an arbitrary density function called “importance density” that is non-zero whenever is non-zero (the support of needs to be greater or equal to the support of ). The purpose of introducing the importance density is to use it as a sampler of random numbers, assuming we don’t know how to sample from the appearing in the penultimate equality. Therefore we should choose a easy to simulate from.

Notice the two crucial simplifications occurring in the fourth equality: we have and . The first result is from the Markov property defined in the previous post, and the second one states the independence of from the “history of prior to time ” **when** **conditioning** on (see the graph in the previous post). Conditional independence was also stated in the previous post.

In conclusion, the importance sampling approach approximates the integral using the following Monte Carlo procedure:

- simulate samples independently, .
- Construct “importance weights” .
- .

Notice that here we wish to simulate sequences with . Clearly, generating at each time a “cloud of particles” spanning the entire interval is not really computationally appealing, and it is not even clear how to construct the importance density. We need some strategy enabling a sort of *sequential* mechanism. Moreover, even if we are able to simulate the particles, are we able to evaluate all the densities appearing in ? For some complex models we might not know the analytic expressions for and . This is addressed in next post.

When the importance density is chosen intelligently, an important property is the one that allows a *sequential* update of the importance weights. That is, for an appropriate we can write

and this sequential update is a key result to make the computation of the -dimensional integral bearable. Here I am going to show how to obtain the weights update above.

First of all, consider that it is up to the analyst to design a density that is appropriate for the model at hand (easy to evaluate, easy to sample from, generating “important” particles). For example, let’s write an importance density as the product of two densities and :

Notice that while depends on measurements up to time , instead depends on measurements up to . This is because we have freedom in designing , as is not subject to the constraints imposed on the state space model (Markovianity, conditional independence). For example does not have to result in , though this choice is allowed. For simplicity, in the following I remove subscripts for and and just write .

I use the decomposition for the importance density to show the sequential update of the importance weights , though for ease of writing I remove the superscript . We have:

Now, note that we can use the Bayes theorem to write the following

and in the derivation I have used both the conditional independence of observations and the Markovianity of the latent state (see the previous post).

We can use this derivation to write , so we can express the weights as

which is the sequential update of the importance weights we wanted to prove. The sequential update is convenient, as when the time index advances in the simulation we do not need to produce particles always starting at time , instead we can perform computations online, by only retaining the weights computed at time .

Notice I used the proportionality symbol as we do not need to carry the constant term resulting from the denominator of the Bayes theorem. This term is independent of , hence is not going to be relevant for parameters optimization (maximum likelihood approach) nor in Bayesian sampling from the posterior distribution of .

Once particles are simulated, we have the following approximations (notation-wise, I now recover the parameter I previously removed):

Adding back in the notation is important not to make confusion between the likelihood function and the “evidence” ,where the latter is *really* independent of (it’s the denominator of the Bayes theorem).

In conclusion, I have outlined some approaches to approximate the likelihood function, by assuming the ability to sample from some importance density . However, I have skipped discussing how to construct such density (indeed, this is a major and problem dependent issue), and a simple possibility is covered in the next post.

It is fair to say that, even though we managed to derive the relations above with success, in practice the computation of the required quantities does not always end up with a good likelihood approximation. This is investigated in the next post, together with strategies to overcome numerical issues.

I have started a quick excursus into Monte Carlo methods for the approximation of the likelihood function for state space models (SSM). I covered a naïve Monte Carlo approach, then importance sampling and finally sequential importance sampling. Each of these topics has been investigated at great length in available literature (there would be so much more to consider, for example quasi-Monte Carlo). However, my focus is to give an idea of possibilities, while quickly moving forward towards introducing the simplest sequential Monte Carlo strategy, which is based on sequential importance sampling. Once sequential Monte Carlo is introduced, I will move to Bayesian parameter estimation for SSM.

Possibilities for self study are endless. Here are a couple of excellent and accessible resources:

- Chapter 7 in Simo Särkkä (2013)
*Bayesian Filtering and Smoothing,*Cambridge University Press. Notice a PDF version is**freely available**at the author’s webpage. Companion software is available at the publisher’s page (see the Resources tab). - Cappé, O., Godsill, S.J. and Moulines, E., 2007. An overview of existing methods and recent advances in sequential Monte Carlo.
*Proceedings of the IEEE*,*95*(5), pp.899-924.

If you have access to the titles below it’s a plus:

- Liu (2004). Monte Carlo Strategies in Scientific Computing, Springer.
- Robert, Casella (2004). Monte Carlo Statistical Methods, Springer.

]]>

I will start by introducing a class of dynamic models known as *state space models*.

For general (non-linear, non-Gaussian) state space models it is only relatively recently that a class of algorithms for exact parameter inference has been devised, in the Bayesian framework. In a series of 4-5 posts I will construct the simplest example of this class of *pseudo-marginal* algorithms, now considered the state-of-art tool for parameter estimation in nonlinear state space models. Pseudo-marginal methods are not exclusively targeting state space models, but are able to produce exact Bayesian inference whenever a positive and unbiased approximation of the likelihood function is available, no matter the underlying model.

I will first define a state space model, then introduce its likelihood function, which turns out to be *intractable*. I postpone to the next post the construction of Monte Carlo methods for approximating the likelihood function.

A very important class of models for engineering applications, signal processing, biomathematics, systems biology, ecology etc., is the class of state-space models (SSM). *[In some literature the terms SSM and hidden Markov model (HMM) have been used interchangeably, though some sources make the explicit distinction that in HMM states are defined over a discrete space while in SSM states vary over a continuous space.]*

Suppose we are interested in modelling a system represented by a (possibly multidimensional) continuous-time stochastic process , where denotes the state of the system at a time . The notation denotes the ensemble of possible values taken by the system for a continuous time .

However, in many experimental situations the experimenter does not have access to measurements from but rather to noisy versions corrupted with “measurement error”. In other words the true state of the system is unknown, because is latent (unobservable), and we can only get to know something about the system via some noisy measurements. I denote the available measurements (data) with and use to denote the process producing the actual observations at discrete time points. For simplicity of notation I assume that measurements are obtained at integer observational times . Each can be multidimensional () but it does not need to have the same dimension of the corresponding , for example some coordinate of might be unobserved. Therefore, the only available information we get from our system is rather partial: (i) the system of interest is continuous in time but measurements are obtained at discrete times and (ii) measurements do not reflect the true state of the system , because the are affected with some measurement noise. For example we could have , with some random noise.

In general and can be either continuous– or discrete–valued stochastic processes. However in the following I assume both processes to be defined on continuous spaces.

I use the notation to denote a sequence . Therefore, data can be written . For the continuous time process I use to denote realizations of the process at times . Clearly, none of the values is known.

Assume that the dynamics of are *parametrized *by a model having a (vector) parameter . The value of is unknown and our goal is to learn something about using available data. That is to say, we wish to produce inference about . I could write in place of , but this makes the notation a bit heavy.

State space models are characterized by two properties: *Markovianity* of the latent states and *conditional independence* of measurements.

**Markovianity:** is assumed a Markov stochastic process, with transition density , for . That is, “given the present state, the past is independent of the future”, so if time is the “present”, then . Also in this case, for simplicity we assume implicit the conditioning on , instead of writing . Specifically for our inference goals, we are interested in transitions between states corresponding to contiguous (integer) observational times, that is . Also “the past is independent of the future, given the present”, meaning that .

**Conditional independence:** measurements are assumed conditionally independent given a realization of the corresponding latent state . That is, .

Markovianity and conditional independence can be represented graphically:

Notice each white node is only able to directly influence the next white node and . Also, each grey node is unable to influence other measurements *directly*. [This does not mean observations are independent: for example and evidently it results in . If equality did hold would be independent of .]

Notice is the initial state of the system at some arbitrary time prior to the first observational time . By convention we can set this arbitrary time to be .

In summary, a compact, probabilistic representation of a state-space model is given by the conditional distribution of the observable variable at given the latent state, and the distribution of the evolution of the (Markovian) state, that is the transition density. Optionally, the initial state can be a fixed deterministic value or have its own (unconditional) distribution which might depend or not on .

**Example: Gaussian random walk**

A trivial example of a (linear, Gaussian) state space model is

with . Therefore

As anticipated, I intend to cover tools for statistical inference for the vector of parameters , and in particular discuss Bayesian inference methods for SSM.

This requires introducing some quantities:

- is the likelihood function of based on measurements .
- In the Bayesian framework is a random quantity and is its prior density (I always assume continuous-valued parameters). It encodes our knowledge about before we “see” the current data .
- The Bayes theorem gives the
*posterior distribution*of , enclosing uncertainty about for given data:

. - is the marginal likelihood (evidence), given by .
- inference based on is called
*Bayesian inference.*

**Goal**: we wish to produce Bayesian inference on . In principle this involves writing down the analytic expression of and study its properties. However, for models of realistic complexity, what is typically performed is some kind of Monte Carlo sampling of pseudo-random draws from the posterior . Then we can have a finite-samples approximation of the marginal posteriors () compute the sample means of the marginals, quantiles etc. This way we perform uncertainty quantification for all components of , for a given model and available data.

Now, the main problem preventing a straightforward sampling from the posterior, is that for nonlinear, non-Gaussian SSM the likelihood function is not available in closed form nor it is possible to derive a closed-form approximation. It is *intractable*. Let’s see why.

In a SSM data are not independent, they are only *conditionally* independent. This means that we cannot write as a product of unconditional densities: instead we have

with the convention .

The problem is that all densities in the expression above are unavailable in closed form, hence unknown. If these were available we could either use an algorithm to find a (local) maximizer to (the maximum likelihood estimate of ), or plug the likelihood into the posterior and perform Bayesian inference.

The reason for the unavailability of a closed-form expression for the likelihood is the latency of process , on which data depend. In fact we have:

The expression above is *intractable* for two reasons:

- it is a -dimensional integral, and
- for most (nontrivial) models, the transition densities are
**unknown**.

Basically the only way to solve the integral when gets large is via Monte Carlo methods. A special case amenable for closed-form computation is the linear SSM with Gaussian noise (see the Gaussian random walk example): for this case the Kalman filter can be employed to return exact maximum likelihood inference. In the SSM literature important (Gaussian) approximations are given by the extended and unscented Kalman filters.

However, for nonlinear and non-Gaussian SSM, sequential Monte Carlo methods (or “particle filters”) are presently the state-of-art methodology. Sequential Monte Carlo (SMC) methods will be considered in a future post. SMC is a convenient tool to implement the pseudo-marginal methodology for exact Bayesian inference that I intend to outline in this first series of posts.

I have introduced minimal notions to set the inference problem for parameters of state space models (SSM). The final goal is to summarize a methodology for exact Bayesian inference, the pseudo-marginal method. This will be outlined in future posts, with focus on SSM. I have also stated some of the computational issues preventing a straightforward implementation of likelihood based methods for the parameters of SSM. In the next two posts I consider some Monte Carlo strategies for approximating the likelihood function.

An excellent, accessible introduction to filtering and parameter estimation for state space models (and recommended for self study) is Simo Särkkä (2013) Bayesian Filtering and Smoothing, Cambridge University Press. The author kindly provides free access to a PDF version at his webpage. Companion software is available at the publisher’s page (see the Resources tab).

State space modelling is a fairly standard topic that can be found treated in most (all?) signal processing and filtering references, so it does not make sense to expand the list of references for this post. However, I will add more suggestions in future posts, in connection with the further topics I introduce.

]]>I will write about statistical inference methods and algorithms, typically (though not exclusively) for models that have some dynamic component. Posts will reflect personal research interests, notably computer-intensive Monte Carlo methods for parameter inference, (approximate) Bayesian methods and likelihood-free methods such as ABC.

My goal is to offer a simplified (though not simplistic) entry to challenging inference methods for stochastic modelling. For some of my posts, the ideal target are postgraduate students in mathematical statistics, or more in general researchers with an adequate background in linear algebra and stochastic processes, with some ability to code. I will keep the notation and detail of the exposition at a fairly accessible level, avoiding measure-theoretic constructs and emphasizing computational and applied aspects.

]]>