## Why and how pseudo-marginal MCMC work for exact Bayesian inference

I will describe a breakthrough strategy for “exact-approximate” Bayesian inference. The apparent contradiction in the terminology is due to the surprising result in Beaumont (2003) and Andrieu and Roberts (2009) where it is shown that plugging a non-negative and unbiased stochastic approximation of the likelihood function into an MCMC sampler is enough to obtain exact Bayesian inference for model parameters.

## The pseudo-marginal approach

I have decided to write about pseudo-marginal MCMC methods since the original literature introducing these methods can be fairly intimidating. Though this need not be. I will show how it is very simple to prove the validity of the approach. Theoretical difficulties may arise depending on the type of pseudo-marginal method we choose to consider for a specific case study. However the basic idea is fairly simple to describe and I proceed at a slow step-by-step pace. Other accessible intro to the topic are in the Resources section at the end of this post.

Suppose we wish to use Markov chain Monte Carlo (MCMC) to sample from a distribution having probability density (probability mass) function $\pi(x)$. Suppose that $\pi(x)$ is not available in closed form and that its pointwise evaluation is not possible at any given $x$. Finally, suppose that a non-negative and “unbiased” random estimator of $\pi(x)$ is available (we define what we mean with unbiased estimator further below), and we denote this estimator with $\hat{\pi}(x)$. Then an intuitive thing to do is to use $\hat{\pi}(x)$ in place of ${\pi}(x)$ into a Metropolis-Hastings algorithm. This way the (approximate) Metropolis-Hastings ratio for a proposal $x'\sim q(x'|x)$ becomes (see this introductory post to Metropolis-Hastings if necessary)

$\hat{r}:=\frac{\hat{\pi}(x')}{\hat{\pi}(x)}\frac{q(x|x')}{q(x'|x)}$

and $x'$ is accepted with probability $\min(1,\hat{r})$. Although it is not evident yet, what we have just written is an example of a pseudo-marginal method, but I will clarify this point below. Therefore in practice applying the method is relatively simple.

The astonishing result provided by a pseudo-marginal method is that the Markov chain constructed as above — where proposals $x'$ are accepted according to $\hat{r}$ instead of the true and unavailable acceptance ratio $r$ (where $r$ uses the exact $\pi(\cdot)$) — has $\pi(x)$ as stationary distribution, the “exact” target. Therefore, we can use the approximate $\hat{\pi}(x)$ into an MCMC sampler, while still obtaining exact Monte Carlo sampling from $\pi(x)$.

In this post whenever we write “exact sampling” we mean “exact up to Monte Carlo error”, where the latter can be arbitrarily decreased by increasing the number of MCMC iterations.

## A simple construction

We want to show that the Markov chain consisting in proposals accepted with probability $\min(1,\hat{r})$ has stationary distribution $\pi(x)$, that is as $R\rightarrow \infty$ the generated chain is sampled from $\pi(x)$, with $R$ the number of MCMC iterations.

First of all we need to obtain a non-negative and unbiased random estimator of $\pi(x)$. We now take the opportunity to define what we mean with “unbiased estimator” in this context.

### Unbiasedness

An unbiased estimator $\hat{\pi}(x)$ of $\pi(x)$ should be such that $\mathbb{E}(\hat{\pi}(x))=\pi(x)$ for every $x$, where $\mathbb{E}$ denotes  expectation. However which probability measure do we consider for this expectation? Answering this question shows why the topic of this post is particularly relevant for sampling from intractable distributions or, in Bayesian inference, sampling from posterior distributions having intractable likelihoods.

Before trying to find unbiased estimators $\hat{\pi}(x)$ let’s first see what unbiasedness mean when using Monte Carlo sampling.

It is often the case that while $\pi(x)$ is unknown in closed form, it is instead easy to write the joint distribution of $x$ and some other random quantity $\xi$ we are not really interested in (we call it a “disturbance parameter” or an “auxiliary variable”). Let’s $\pi(x,\xi)$ be their joint distribution. Assume the domain of the marginalized $\xi$ to be continuous (without loss of generality), then marginalization implies integration and

$\pi(x)=\int \pi(x,\xi)d\xi$

where basically the “unwanted” source of variability $\xi$ is integrated out. But then again we assume that analytic integration is out of reach.

A possibility to approximate the integral above is given by Monte Carlo methods, resulting into a stochastic approximation that we show to be unbiased. We simulate $N$ iid samples $\xi^1,...,\xi^N$ from the distribution of $\xi$ which we denote with $p(\xi)$

${\pi}(x)=\int {\pi}(x,\xi)d\xi =\int {\pi}(x|\xi)p(\xi)d\xi \approx \frac{1}{N}\sum_{n=1}^N {\pi}(x|\xi^n), \quad \xi^n\sim_{\mathrm{iid}} p(\xi),\quad n=1,...,N$

The second identity rewrites the joint distribution using both conditional probabilities and $p(\xi)$. Finally we  carry out Monte Carlo integration. Notice, we do not need to know the expression of $p(\xi)$, but we need to be able to simulate from it. Clearly, the above assumes that we are able to evaluate $\pi(x|\xi)$ at every $x$ (this construction naturally applies to state-space models, when we consider $x$ as “data” and the $\xi^1,...,\xi^N$ as “particles”).

From the second identity it should be clear that $\pi(x)$ can be considered as the expectation of $\pi(x|\xi)$  taken with respect to $p(\xi)$, as we have

${\pi}(x)=\int {\pi}(x|\xi)p(\xi)d\xi = \mathbb{E}_\xi [{\pi}(x|\xi)]$ and the notation $\mathbb{E}_\xi$ emphasizes this fact.

Finally, it is easy to show that the Monte Carlo approximation gives an unbiased estimator of $\pi(x)$, as for any simulated $\xi^n$ we have shown that $\mathbb{E}_\xi [{\pi}(x|\xi^n)] = \pi(x)$ and therefore by denoting with $\hat{\pi}(x)$ the approximated density returned by Monte Carlo we define

$\hat{\pi}(x):=\frac{1}{N}\sum_{n=1}^N {\pi}(x|\xi^n)$ and

$\mathbb{E}_\xi [\hat{\pi}(x)]=\mathbb{E}_\xi [\frac{1}{N}\sum_{n=1}^N {\pi}(x|\xi^n)]=\frac{1}{N}\sum_{n=1}^N \mathbb{E}_\xi [{\pi}(x|\xi^n)]=\pi(x)$

this holding for any $x$ and for any number of samples $N\geq 1$. Also, the estimate is clearly non-negative.

Having discussed unbiasedness for simple Monte Carlo it is now easy to show how to use this fact for statistical inference.

### Pseudo-marginal MCMC for exact Bayesian inference with intractable likelihoods

In the previous section we have considered a generic $\pi(x)$. Here we use Metropolis-Hastings in a Bayesian context to simulate from a specific distribution, the posterior distribution $\pi(\theta|y)$ for a parameter $\theta\in\Theta$, given data $y$. From Bayes’ theorem we have  $\pi(\theta|y)=\pi(\theta)p(y|\theta)/p(y)$, with $\pi(\theta)$ the prior of $\theta$, $p(y|\theta)$ the likelihood function and $p(y)$ the “evidence” (if you are interested in state-space models you might want to check an earlier post).

In most realistic modelling scenario, the likelihood function is unavailable in closed form and we say that the statistical model we are considering has an “intractable likelihood”. However it might be possible to obtain an unbiased non-negative approximation of the likelihood $\hat{p}(y|\theta)$. We show that in this case it is also possible to obtain $R$ samples $\theta_1,...,\theta_R$ from $\pi(\theta|y)$, that is from the exact posterior of $\theta$.

The previous section used Monte Carlo integration, by independently sampling from a distribution employing $\xi\in \mathcal{A}$, an “auxiliary” variable we do not care for. Also in this case we consider sampling from an “augmented” space, which is the space $\Theta\times \mathcal{A}$ over which the pair $(\theta,\xi)$ is defined.

Therefore we now artifically sample from the augmented approximate posterior (it is approximate because we have an approximate likelihood) $\hat{\pi}(\theta,\xi|y)$ even if we are ultimately interested in $\pi(\theta|y)$. In this context, in order to write the joint (augmented) posterior we need to introduce the prior for $\xi$, which we call $p(\xi)$. Recall we are not interested in conducting inference for $\xi$, therefore the definition of its prior is only guided by convenience as we are required to be able to sample from $p(\xi)$. Also, we assume $\theta$ and $\xi$ a-priori independent, that is their joint prior can be written as $\pi(\theta)p(\xi)$.

As mentioned above, we assume the availability of a non-negative unbiased approximation to the likelihood which we denote $\hat{p}(y|\theta)$. By introducing the usual auxiliary variable, this approximation can be written as

$\hat{p}(y|\theta)=\int \hat{p}(y,\xi|\theta)d\xi=\int \hat{p}(y|\xi,\theta)p(\xi|\theta)d\xi=\int \hat{p}(y|\xi,\theta)p(\xi)d\xi$.

Notice we took $p(\xi|\theta)\equiv p(\xi)$ bacause of the assumed a-priori independence between $\theta$ and $\xi$.

Then we require the key assumption of unbiasedness. Again we consider $\hat{p}(y|\theta)$ as an $\mathbb{E}_\xi$-expectation and we want that

$\hat{p}(y|\theta)= \mathbb{E}_\xi(\hat{p}(y|\theta,\xi))=\int \hat{p}(y|\theta,\xi)p(\xi)d\xi=p(y|\theta)$.

Now that we have set these requirements on the likelihood function, we proceed to write the approximate augmented posterior:

$\hat{\pi}(\theta,\xi|y)=\frac{\hat{p}(y|\xi,\theta)p(\xi)\pi(\theta)}{p(y)}$

[Notice we put $p(y)$ not $\hat{p}(y)$ at the denominator: this follows from the unbiasedness assumption as we have $\int\int \hat{p}(y|\theta,\xi)p(\xi)\pi(\theta)d\xi d\theta = \int \pi(\theta)\{\int \hat{p}(y|\theta,\xi)p(\xi)d\xi\}d\theta=\int \pi(\theta)p(y|\theta)d\theta$ and the latter is by definition the evidence $p(y)$.]

We know the exact (unavailable) posterior of $\theta$ is

${\pi}(\theta|y)=\frac{{p}(y|\theta)\pi(\theta)}{p(y)}$

therefore the marginal likelihood (evidence) is

$p(y)=\frac{p(y|\theta)\pi(\theta)}{\pi(\theta|y)}$ and then we plug the evidence into $\hat{\pi}(\theta,\xi|y)$:

$\hat{\pi}(\theta,\xi|y)=\frac{\hat{p}(y|\theta,\xi)p(\xi)\pi(\theta)}{p(y)}=\frac{\pi(\theta|y)\hat{p}(y|\theta,\xi)p(\xi)\pi(\theta)}{p(y|\theta)\pi(\theta)}=\frac{\pi(\theta|y)\hat{p}(y|\theta,\xi)p(\xi)}{p(y|\theta)}.$

Now, we know that applying an MCMC targeting $\hat{\pi}(\theta,\xi|y)$ then discarding the output pertaining to $\xi$ corresponds to integrating-out $\xi$ from the posterior, that is marginalize $\xi$ out. Therefore, assuming we use MCMC to perform such marginalization, the draws $\theta_1,...,\theta_R$ we are returned with have the following stationary distribution

$\int\hat{\pi}(\theta,\xi|y)d\xi = \frac{\pi(\theta|y)}{p(y|\theta)}\int \hat{p}(y|\theta,\xi)p(\xi)d\xi=\frac{\pi(\theta|y)}{p(y|\theta)}p(y|\theta)=\pi(\theta|y)$.

Notice we have recovered the exact posterior of $\theta$.

We have therefore performed a pseudo-marginal approach: “marginal” because we disregard $\xi$; pseudo because we use $\hat{p}(\cdot)$ not $p(\cdot)$.

In conclusion we have shown that using MCMC on an (artificially) augmented posterior, then discard from the output all the auxiliary variates we generated during the entire MCMC procedure, implies that the sampled parameter draws $\theta_1,...,\theta_R$ are the product of an exact Bayesian inference procedure for $\theta$.

Besides the theoretical construction we have described,  in practice all that is needed is running a regular MCMC for $\theta$, without caring of storing the generated samples for $\xi$.

Pseudo-marginal MCMC methods are due to Beaumont (2003) and Andrieu and Roberts (2009) and have revolutionised the application of Bayesian inference for models having intractable likelihoods. In my opinion they constitute one of the most important statistical advancements of the last thirty years.

In particular, the methodology has found an enormous success in inference for state-space models, as sequential Monte Carlo produces unbiased approximations of the likelihood for any number of particles $N$ (though unbiasedness is not trivial to prove in this case, but see Pitt et al or section 4A in Naesseth et al. which are way more accessible than the original Proposition 7.4.2 of Del Moral (2004)). It is even possible to obtain exact inference simultaneously for the states and the parameters of state-space models.

However, as shown above, the generality of the approach makes pseudo-marginal methods appealing for a wide range of scenarios, including approximate Bayesian computation and other methods for likelihood-free inference such as synthetic likelihoods.

Of extreme importance is the fact that the results presented above hold irrespectively of how we have obtained the unbiased approximation of the likelihood. If $\hat{p}(y|\theta)$ has been approximated via Monte Carlo using a “large” number $N$ of auxiliary variates or a “small” $N$, the theoretical result still hold and we sample exactly from $\pi(\theta|y)$ regardless of $N$. However, for practical applications tuning $N$ appropriately is important, as its value influence the precision of the likelihood approximation which in turn has an effect on the mixing of the chain. See this post and references therein.