I will suggest several tips, and discuss common beginner’s mistakes occuring when coding from scratch a Metropolis-Hastings algorithm. I am making this list from the top of my mind, so feel free to propose suggestions by commenting to this post. Most items are related to coding practice rather than actual statistical methodology, and are often not mentioned in widely available literature. If you are an instructor, you might find it useful to mention these issues in your class, to avoid students getting puzzled when coding mathematically flawless algorithms having a misbehaving implementation.
Notice, this post does not discuss the design of a Metropolis-Hastings sampler: that is important issues affecting MCMC, such as bad mixing of the chain, multimodality, or the construction of “ad hoc” proposal distributions are not considered.
I am assuming the reader already knows what a Metropolis-Hastings algorithm is, what it is useful for, and that she has already seen (and perhaps tried coding) some implementation of this algorithm. Even given this background knowledge, there are some subtleties that are good to know and that I will discuss below.
Briefly, what we wish to accomplish with Markov chain Monte Carlo (MCMC) is to sample (approximately) from a distribution having density or probability mass function . Here we focus on the Metropolis-Hastings algorithm (MH). In order to do so, assume that our Markov chain is currently in “state” , and we wish to propose a move . This move is generated from a proposal kernel . We write the proposed value as . Then is accepted with probability or rejected with probability where
Then is called acceptance probability.
A very simplified pseudo-algorithm for a single step of MH looks like:
propose accept with probability otherwise reject
However, since we wish to sample say times from , we better store the result of our simulations into a matrix and write (assuming an arbitrary initial state )
for r = 1 to R propose compute accept with probability then store it into draws[r]:= set , set otherwise store draws[r]:= end
Here means “copy the value at the right-hand-side into the left-hand-side”. After a few (sometimes many) iterations the chain reaches stationarity, and its stationary distribution happens to have as probability density (or probability mass function). In the end we will explore the content of and disregard some of the initial values, to wash-away the influence of the arbitrary .
1- You are given the pseudo-code above but it’s still unclear how to use it for sampling from
The very first time I heard of MH (self study and no guidance available) was in 2008. So I thought “ok let’s google some article!” and I recall that all resources I happened to screen were simply stating (just as I wrote above) “accept the move with probability and reject otherwise”. Now, if you are new to stochastic simulation the quoted sentence looks quite mysterious. The detail that was missing in my literature search was the logical link between the concept of acceptance probability and that of sampling.
Here is the link: you have to randomly decide between two possible outcomes of a random decision: accept or reject a move, and each of these two outcomes has its own probability ( for accepting and for rejecting). So our decision is a random variable taking two possible values, hence is a Bernoulli random variable with “parameter” . And here is the catch: to sample from a discrete distribution (like a Bernoulli distribution) you can use the inverse transform method, which in the special case of a binary decision translates to (i) sample a pseudo-random number from a uniform distribution on the interval , that is which can be obtained in R via , and (ii) if accept otherwise reject. That’s it (*).
(*) Well, yes that’s it. However the coding of the implementation can be simplified to be just: if then accept otherwise reject it. This is because hence whenever is larger than 1, then certainly .
In conclusion for a single MH step we substitute (and for simplicity here we remove the dependency on the iteration index we introduced in the previous code)
compute accept with probability
simulate if then and
2- Do not keep only the accepted draws!
A common mistake is not advancing the iteration counter until we have an accept. This is wrong (hence if you ever coded a MH algorithm using a loop it is fairly likely you may have acted the wrong way). In the implementation given in the introduction to this post (the one using the loop) you may notice that when a proposal is rejected I store the last accepted value as .
3- Save and reuse the values at the denominator
If you compare the pseudo-codes suggested above you may have noticed that when a proposal is accepted I “save” , in addition to storing the accepted proposal . While it is not necessary to do the former (and in fact most descriptions of MH in the literature do not mention this part) it is a waste of computer time not to keep in memory the accepted , so that we do not need to recompute it at the next iteration (since it is reused at the denominator of ).
This is especially relevant when MH is used for Bayesian inference, where is a posterior distribution such that for some “prior” and “likelihood function” . In fact, it is often the case that the pointwise evaluation of the likelihood is far from being instantaneous, e.g. because the data has thousands of values, or the evaluation of the likelihood itself requires some numerical approximation.
4- Code your calculations on the log-domain.
Depending on cases, may sometimes take very large or very small values. Such extremely large/small values may cause a numeric overflow or an underflow, that is your computing environment will be unable to represent these numbers correctly. For example a standard Gaussian density is strictly positive for all real , but if you evaluate it at your software will likely return zero (this is an underflow).
For example, in R type and you’ll see it returns zero .
Therefore instead of using
if accept otherwise reject
if accept otherwise reject
Anyhow, while this recommendation is valid, it might not be enough without further precautions, as described in the next point.
5- Do as many analytic calculations as feasible in order to prevent overflow/underflow.
Strongly connected to the previous point is coding without doing your calculations analytically first. That is, I see beginners trying to use the log-domain suggestion from point 4 to evaluate . This will not help because the log will be applied to a function that has already underflown, and the logarithm of zero is taken.
In R you can use to compute which is the safe way to evaluate the (normalised) standard Gaussian log-density. For other languages you may need to code the log-density yourself.
However, since we are interested in using MH, we may even just code the unnormalised log-density
where I have omitted the irrelevant (for MH) normalization constant . This brings us to the next point.
6- Remove proportionality constants.
An important feature of MH is that it can sample from distributions that are known up to proportionality constants (these constants are terms independent of ). This means that can be an unnormalised density (or an unnormalised probability mass function). As mentioned in the previous point, we do not strictly need to write but can get rid of the normalization constant that anyhow simplifies out when taking the ratio .
The simplification of proportionality constants is not just a matter of coding elegance, but can affect the performance of the algorithm. For example if is a Gamma distribution, such as then contains Gamma functions as part of the proportionality constant. When is not an integer but is a positive real number the evaluation of involves the numerical approximation of an integral. Now, this approximation is not honerous if executed only a few times. However, often MCMC algorithms are run for millions of iterations, and in such cases we can save computational time by removing constants we don’t need anyway. In conclusion, you can safely avoid coding (or its logarithm) as it is independent of .
7- Simplify out symmetric proposal functions
A symmetric proposal function is such that for every and . Therefore symmetric proposals do not need to be coded when evaluating the ratio in as they simplify out.
For example, assume a multivariate Gaussian proposal function written as (with some abuse of notation) , that is a Gaussian density with mean and covariance matrix . Then we have
where denotes is the determinant of a matrix and denotes transposition. Clearly this is symmetric because of the quadratic form at the exponent.
8a- Remember the transformation Jacobians!
Most previous recommendations are related to numerical issues in the implementation of MH. However this one alerts you of an actual statistical mistake. I introduce the problem using a specific example.
A Gaussian proposal function is a simple choice that often works well in problems where is low dimensional (e.g. is a vector of length up to, say, five elements) and the geometry of is not too complicated. The fact that a Gaussian proposal has support on the entire real space makes it a useful default choice. However, when some elements of are constrained to “live” in a specific subset of the real space, we should be aware that a significant proportion of proposed values will be rejected (unless we “design” to intelligently explore only a promising region of the space). Wasting computational time is not something we want right?
For the sake of illustration, let’s assume that is a positive scalar random variable (see point 8b below for a more general discussion). We could code a Metropolis random walk sampler with Gaussian innovations as
where the “step length” is chosen by the experimenter and is sampled from a standard Gaussian. Therefore we have that . Clearly will certainly be rejected whenever since we are assuming that the support of the targeted is positive.
A simple workaround is to propose instead of , that is code
for some other appropriate . Then we evaluate as , that is we exponentiate the proposal before “submitting” it to . Now if this is all we do, we have a problem.
Thing is we proposed from a Gaussian, and then we thought that exponentiating the proposal is just fine. It is almost fine, but we must realise that the exponential of a Gaussian draw is log-normal distributed and we must account for this fact in the acceptance ratio! In other words we sampled using a proposal function that here I denote (which is Gaussian) and then by exponentiating we effectively sampled from another proposal (which is log-normal).
Therefore, from the definition of log-normal density we have (recall in this example is scalar)
or equivalently .
Then we notice that this proposal function is not symmetric (see point 7) and . Hence the correct sampler does the following: simulate a and using all the recommendations above we
and reject otherwise.
But how would we proceed if we were not given the information that the exponential of a Gaussian variable is a log-normal variable? A simple result on the transformation of random variables will assist us, when the transformation is an invertible function. Say that for a generic we seek for the distribution of , and using the transformation rule linked above we have that has density function . Notice here is the “Jacobian of the transformation”.
Therefore, we should not forget the “Jacobians” when transforming the proposed draw. In our example above is and the ratio in the acceptance probability is the ratio of Jacobians.
8b- Recall the tranformation Jacobian! Part 2
The discussion in 8a above treats the specific case of using a random walk proposal to target a distribution with positive support. How about different type of constraints? My colleague at Lund University Johan Lindström has written a compact and useful reference table, listing common transformations of constrained proposals generated via random walk.
9- Check the starting value of
Generally it is not really necessary to place checks for infelicities produced by the MCMC sampler in your code. That is, it is not necessary for each proposed to check for complex values or check that does not result into a (not-a-number) or that you do not get, say, (yes these things can happen if you code yourself without caution or when this is the result of some approximation, e.g. the expression for is not analytically known).
In fact in these cases the uniform draw in point 1 above will not be considered as smaller than . So the comparison will evaluate to a logical statement and the proposed will be rejected, as it should be.
However an exception to the above is when we evaluate at the starting value . In this case we do want to check that is not something unwanted. For example, say that (or ) and we just put this value at the denominator of the ratio in the definition of (since this is the starting value it automatically goes into the denominator). Then at the first proposed we have (assume is finite)
Clearly no will ever be accepted, therefore the value of at the denominator will stay for all iterations and the sampler will never move away from .