I will describe a breakthrough strategy for “exact-approximate” Bayesian inference. The apparent contradiction in the terminology is due to the surprising result in Beaumont (2003) and Andrieu and Roberts (2009) where it is shown that plugging a non-negative and unbiased stochastic approximation of the likelihood function into an MCMC sampler is enough to obtain exact Bayesian inference for model parameters.
The pseudo-marginal approach
I have decided to write about pseudo-marginal MCMC methods since the original literature introducing these methods can be fairly intimidating. Though this need not be. I will show how it is very simple to prove the validity of the approach. Theoretical difficulties may arise depending on the type of pseudo-marginal method we choose to consider for a specific case study. However the basic idea is fairly simple to describe and I proceed at a slow step-by-step pace. Other accessible intro to the topic are in the Resources section at the end of this post.
Suppose we wish to use Markov chain Monte Carlo (MCMC) to sample from a distribution having probability density (probability mass) function . Suppose that is not available in closed form and that its pointwise evaluation is not possible at any given . Finally, suppose that a non-negative and “unbiased” random estimator of is available (we define what we mean with unbiased estimator further below), and we denote this estimator with . Then an intuitive thing to do is to use in place of into a Metropolis-Hastings algorithm. This way the (approximate) Metropolis-Hastings ratio for a proposal becomes (see this introductory post to Metropolis-Hastings if necessary)
and is accepted with probability . Although it is not evident yet, what we have just written is an example of a pseudo-marginal method, but I will clarify this point below. Therefore in practice applying the method is relatively simple.
The astonishing result provided by a pseudo-marginal method is that the Markov chain constructed as above — where proposals are accepted according to instead of the true and unavailable acceptance ratio (where uses the exact ) — has as stationary distribution, the “exact” target. Therefore, we can use the approximate into an MCMC sampler, while still obtaining exact Monte Carlo sampling from .
In this post whenever we write “exact sampling” we mean “exact up to Monte Carlo error”, where the latter can be arbitrarily decreased by increasing the number of MCMC iterations.
A simple construction
We want to show that the Markov chain consisting in proposals accepted with probability has stationary distribution , that is as the generated chain is sampled from , with the number of MCMC iterations.
First of all we need to obtain a non-negative and unbiased random estimator of . We now take the opportunity to define what we mean with “unbiased estimator” in this context.
An unbiased estimator of should be such that for every , where denotes expectation. However which probability measure do we consider for this expectation? Answering this question shows why the topic of this post is particularly relevant for sampling from intractable distributions or, in Bayesian inference, sampling from posterior distributions having intractable likelihoods.
Before trying to find unbiased estimators let’s first see what unbiasedness mean when using Monte Carlo sampling.
It is often the case that while is unknown in closed form, it is instead easy to write the joint distribution of and some other random quantity we are not really interested in (we call it a “disturbance parameter” or an “auxiliary variable”). Let’s be their joint distribution. Assume the domain of the marginalized to be continuous (without loss of generality), then marginalization implies integration and
where basically the “unwanted” source of variability is integrated out. But then again we assume that analytic integration is out of reach.
A possibility to approximate the integral above is given by Monte Carlo methods, resulting into a stochastic approximation that we show to be unbiased. We simulate iid samples from the distribution of which we denote with
The second identity rewrites the joint distribution using both conditional probabilities and . Finally we carry out Monte Carlo integration. Notice, we do not need to know the expression of , but we need to be able to simulate from it. Clearly, the above assumes that we are able to evaluate at every (this construction naturally applies to state-space models, when we consider as “data” and the as “particles”).
From the second identity it should be clear that can be considered as the expectation of taken with respect to , as we have
and the notation emphasizes this fact.
Finally, it is easy to show that the Monte Carlo approximation gives an unbiased estimator of , as for any simulated we have shown that and therefore by denoting with the approximated density returned by Monte Carlo we define
this holding for any and for any number of samples . Also, the estimate is clearly non-negative.
Having discussed unbiasedness for simple Monte Carlo it is now easy to show how to use this fact for statistical inference.
Pseudo-marginal MCMC for exact Bayesian inference with intractable likelihoods
In the previous section we have considered a generic . Here we use Metropolis-Hastings in a Bayesian context to simulate from a specific distribution, the posterior distribution for a parameter , given data . From Bayes’ theorem we have , with the prior of , the likelihood function and the “evidence” (if you are interested in state-space models you might want to check an earlier post).
In most realistic modelling scenario, the likelihood function is unavailable in closed form and we say that the statistical model we are considering has an “intractable likelihood”. However it might be possible to obtain an unbiased non-negative approximation of the likelihood . We show that in this case it is also possible to obtain samples from , that is from the exact posterior of .
The previous section used Monte Carlo integration, by independently sampling from a distribution employing , an “auxiliary” variable we do not care for. Also in this case we consider sampling from an “augmented” space, which is the space over which the pair is defined.
Therefore we now artifically sample from the augmented approximate posterior (it is approximate because we have an approximate likelihood) even if we are ultimately interested in . In this context, in order to write the joint (augmented) posterior we need to introduce the prior for , which we call . Recall we are not interested in conducting inference for , therefore the definition of its prior is only guided by convenience as we are required to be able to sample from . Also, we assume and a-priori independent, that is their joint prior can be written as .
As mentioned above, we assume the availability of a non-negative unbiased approximation to the likelihood which we denote . By introducing the usual auxiliary variable, this approximation can be written as
Notice we took bacause of the assumed a-priori independence between and .
Then we require the key assumption of unbiasedness. Again we consider as an -expectation and we want that
Now that we have set these requirements on the likelihood function, we proceed to write the approximate augmented posterior:
[Notice we put not at the denominator: this follows from the unbiasedness assumption as we have and the latter is by definition the evidence .]
We know the exact (unavailable) posterior of is
therefore the marginal likelihood (evidence) is
and then we plug the evidence into :
Now, we know that applying an MCMC targeting then discarding the output pertaining to corresponds to integrating-out from the posterior, that is marginalize out. Therefore, assuming we use MCMC to perform such marginalization, the draws we are returned with have the following stationary distribution
Notice we have recovered the exact posterior of .
We have therefore performed a pseudo-marginal approach: “marginal” because we disregard ; pseudo because we use not .
In conclusion we have shown that using MCMC on an (artificially) augmented posterior, then discard from the output all the auxiliary variates we generated during the entire MCMC procedure, implies that the sampled parameter draws are the product of an exact Bayesian inference procedure for .
Besides the theoretical construction we have described, in practice all that is needed is running a regular MCMC for , without caring of storing the generated samples for .
Pseudo-marginal MCMC methods are due to Beaumont (2003) and Andrieu and Roberts (2009) and have revolutionised the application of Bayesian inference for models having intractable likelihoods. In my opinion they constitute one of the most important statistical advancements of the last thirty years.
In particular, the methodology has found an enormous success in inference for state-space models, as sequential Monte Carlo produces unbiased approximations of the likelihood for any number of particles (though unbiasedness is not trivial to prove in this case, but see Pitt et al or section 4A in Naesseth et al. which are way more accessible than the original Proposition 7.4.2 of Del Moral (2004)). It is even possible to obtain exact inference simultaneously for the states and the parameters of state-space models.
However, as shown above, the generality of the approach makes pseudo-marginal methods appealing for a wide range of scenarios, including approximate Bayesian computation and other methods for likelihood-free inference such as synthetic likelihoods.
Of extreme importance is the fact that the results presented above hold irrespectively of how we have obtained the unbiased approximation of the likelihood. If has been approximated via Monte Carlo using a “large” number of auxiliary variates or a “small” , the theoretical result still hold and we sample exactly from regardless of . However, for practical applications tuning appropriately is important, as its value influence the precision of the likelihood approximation which in turn has an effect on the mixing of the chain. See this post and references therein.
- Dahlin and Schön 2017: intro paper focussed on state-space models and computer implementation in R;
- D. Wilkinson’s blog post on particle marginal methods for parameter inference;
- D. Wilkinson’s blog post on particle marginal methods for parameter and state inference in state-space models.
- Fasiolo et al. 2016 : paper comparing several algorithmic strategies for intractable likelihoods.
- Technical papers on tuning particle marginal algorithms: Doucet et al. 2012; Pitt et al 2012; Sherlock et al. 2015; Sherlock 2016;
- Some demo software: my Matlab example; my R code (see the pomp_ricker-pmcmc file);
- Some serious R software: the pomp package (function pmcmc); smfsb (type demo(“PMCMC”) at command line); Nimble.
- More serious software for state-space models: LibBi; Biips (not sure if this one is actively maintained).