I will suggest several tips, and discuss common beginner’s mistakes occuring when coding from scratch a Metropolis-Hastings algorithm. I am making this list from the top of my mind, so feel free to propose suggestions by commenting to this post. Most items are related to coding practice rather than actual statistical methodology, and are often not mentioned in widely available literature. If you are an instructor, you might find it useful to mention these issues in your class, to avoid students getting puzzled when coding mathematically flawless algorithms having a misbehaving implementation.
Notice, this post does not discuss the design of a Metropolis-Hastings sampler: that is important issues affecting MCMC, such as bad mixing of the chain, multimodality, or the construction of “ad hoc” proposal distributions are not considered.
I am assuming the reader already knows what a Metropolis-Hastings algorithm is, what it is useful for, and that she has already seen (and perhaps tried coding) some implementation of this algorithm. Even given this background knowledge, there are some subtleties that are good to know and that I will discuss below.
Briefly, what we wish to accomplish with Markov chain Monte Carlo (MCMC) is to sample (approximately) from a distribution having density or probability mass function
. Here we focus on the Metropolis-Hastings algorithm (MH). In order to do so, assume that our Markov chain is currently in “state”
, and we wish to propose a move
. This move is generated from a proposal kernel
. We write the proposed value
as
. Then
is accepted with probability
or rejected with probability
where

Then
is called acceptance probability.
A very simplified pseudo-algorithm for a single step of MH looks like:
propose
accept
with probability
otherwise reject 
However, since we wish to sample say
times from
, we better store the result of our simulations into a matrix
and write (assuming an arbitrary initial state
)
for r = 1 to R
propose
compute
accept
with probability
then store it into draws[r]:=
set
, set
otherwise store draws[r]:=
end
Here
means “copy the value at the right-hand-side into the left-hand-side”. After a few (sometimes many) iterations the chain reaches stationarity, and its stationary distribution happens to have
as probability density (or probability mass function). In the end we will explore the content of
and disregard some of the initial values, to wash-away the influence of the arbitrary
.
1- You are given the pseudo-code above but it’s still unclear how to use it for sampling from 
The very first time I heard of MH (self study and no guidance available) was in 2008. So I thought “ok let’s google some article!” and I recall that all resources I happened to screen were simply stating (just as I wrote above) “accept the move with probability
and reject otherwise”. Now, if you are new to stochastic simulation the quoted sentence looks quite mysterious. The detail that was missing in my literature search was the logical link between the concept of acceptance probability and that of sampling.
Here is the link: you have to randomly decide between two possible outcomes of a random decision: accept or reject a move, and each of these two outcomes has its own probability (
for accepting and
for rejecting). So our decision is a random variable taking two possible values, hence is a Bernoulli random variable with “parameter”
. And here is the catch: to sample from a discrete distribution (like a Bernoulli distribution) you can use the inverse transform method, which in the special case of a binary decision translates to (i) sample a pseudo-random number
from a uniform distribution on the interval
, that is
which can be obtained in R via
, and (ii) if
accept
otherwise reject. That’s it (*).
(*) Well, yes that’s it. However the coding of the implementation can be simplified to be just: if
then accept
otherwise reject it. This is because
hence whenever
is larger than 1, then certainly
.
In conclusion for a single MH step we substitute (and for simplicity here we remove the dependency on the iteration index
we introduced in the previous code)
compute
accept
with probability 
with
simulate
if
then
and 
2- Do not keep only the accepted draws!
A common mistake is not advancing the iteration counter until we have an accept. This is wrong (hence if you ever coded a MH algorithm using a
loop it is fairly likely you may have acted the wrong way). In the implementation given in the introduction to this post (the one using the
loop) you may notice that when a proposal is rejected I store the last accepted value as
.
3- Save and reuse the values at the denominator
If you compare the pseudo-codes suggested above you may have noticed that when a proposal is accepted I “save”
, in addition to storing the accepted proposal
. While it is not necessary to do the former (and in fact most descriptions of MH in the literature do not mention this part) it is a waste of computer time not to keep in memory the accepted
, so that we do not need to recompute it at the next iteration (since it is reused at the denominator of
).
This is especially relevant when MH is used for Bayesian inference, where
is a posterior distribution
such that
for some “prior”
and “likelihood function”
. In fact, it is often the case that the pointwise evaluation of the likelihood is far from being instantaneous, e.g. because the data has thousands of values, or the evaluation of the likelihood itself requires some numerical approximation.
4- Code your calculations on the log-domain.
Depending on cases,
may sometimes take very large or very small values. Such extremely large/small values may cause a numeric overflow or an underflow, that is your computing environment will be unable to represent these numbers correctly. For example a standard Gaussian density is strictly positive for all real
, but if you evaluate it at
your software will likely return zero (this is an underflow).
For example, in R type
and you’ll see it returns zero .
Therefore instead of using
if
accept
otherwise reject
use
if
accept
otherwise reject
Anyhow, while this recommendation is valid, it might not be enough without further precautions, as described in the next point.
5- Do as many analytic calculations as feasible in order to prevent overflow/underflow.
Strongly connected to the previous point is coding
without doing your calculations analytically first. That is, I see beginners trying to use the log-domain suggestion from point 4 to evaluate
. This will not help because the log will be applied to a function
that has already underflown, and the logarithm of zero is taken.
In R you can use
to compute
which is the safe way to evaluate the (normalised) standard Gaussian log-density. For other languages you may need to code the log-density yourself.
However, since we are interested in using MH, we may even just code the unnormalised log-density

where I have omitted the irrelevant (for MH) normalization constant
. This brings us to the next point.
6- Remove proportionality constants.
An important feature of MH is that it can sample from distributions that are known up to proportionality constants (these constants are terms independent of
). This means that
can be an unnormalised density (or an unnormalised probability mass function). As mentioned in the previous point, we do not strictly need to write
but can get rid of the normalization constant
that anyhow simplifies out when taking the ratio
.
The simplification of proportionality constants is not just a matter of coding elegance, but can affect the performance of the algorithm. For example if
is a Gamma distribution, such as
then
contains Gamma functions
as part of the proportionality constant. When
is not an integer but is a positive real number the evaluation of
involves the numerical approximation of an integral. Now, this approximation is not honerous if executed only a few times. However, often MCMC algorithms are run for millions of iterations, and in such cases we can save computational time by removing constants we don’t need anyway. In conclusion, you can safely avoid coding
(or its logarithm) as it is independent of
.
7- Simplify out symmetric proposal functions
A symmetric proposal function is such that
for every
and
. Therefore symmetric proposals do not need to be coded when evaluating the ratio in
as they simplify out.
For example, assume a multivariate Gaussian proposal function written as (with some abuse of notation)
, that is a Gaussian density with mean
and covariance matrix
. Then we have

where
denotes is the determinant of a matrix and
denotes transposition. Clearly this
is symmetric because of the quadratic form at the exponent.
8a- Remember the transformation Jacobians!
Most previous recommendations are related to numerical issues in the implementation of MH. However this one alerts you of an actual statistical mistake. I introduce the problem using a specific example.
A Gaussian proposal function
is a simple choice that often works well in problems where
is low dimensional (e.g.
is a vector of length up to, say, five elements) and the geometry of
is not too complicated. The fact that a Gaussian proposal has support on the entire real space makes it a useful default choice. However, when some elements of
are constrained to “live” in a specific subset of the real space, we should be aware that a significant proportion of proposed values will be rejected (unless we “design”
to intelligently explore only a promising region of the space). Wasting computational time is not something we want right?
For the sake of illustration, let’s assume that
is a positive scalar random variable (see point 8b below for a more general discussion). We could code a Metropolis random walk sampler with Gaussian innovations as

where the “step length”
is chosen by the experimenter and
is sampled from a standard Gaussian. Therefore we have that
. Clearly
will certainly be rejected whenever
since we are assuming that the support of the targeted
is positive.
A simple workaround is to propose
instead of
, that is code

for some other appropriate
. Then we evaluate
as
, that is we exponentiate the proposal
before “submitting” it to
. Now if this is all we do, we have a problem.
Thing is we proposed from a Gaussian, and then we thought that exponentiating the proposal is just fine. It is almost fine, but we must realise that the exponential of a Gaussian draw is log-normal distributed and we must account for this fact in the acceptance ratio! In other words we sampled
using a proposal function that here I denote
(which is Gaussian) and then by exponentiating we effectively sampled
from another proposal
(which is log-normal).
Therefore, from the definition of log-normal density
we have (recall in this example
is scalar)

or equivalently
.
Then we notice that this proposal function is not symmetric (see point 7) and
. Hence the correct sampler does the following: simulate a
and using all the recommendations above we

and reject
otherwise.
But how would we proceed if we were not given the information that the exponential of a Gaussian variable is a log-normal variable? A simple result on the transformation of random variables will assist us, when the transformation
is an invertible function. Say that for a generic
we seek for the distribution of
, and using the transformation rule linked above we have that
has density function
. Notice here
is the “Jacobian of the transformation”.
Therefore, we should not forget the “Jacobians” when transforming the proposed draw. In our example above
is
and the ratio
in the acceptance probability is the ratio of Jacobians.
8b- Recall the tranformation Jacobian! Part 2
The discussion in 8a above treats the specific case of using a random walk proposal to target a distribution with positive support. How about different type of constraints? My colleague at Lund University Johan Lindström has written a compact and useful reference table, listing common transformations of constrained proposals generated via random walk.
See also an excellent and detailed post by Darren Wilkinson.
9- Check the starting value of 
Generally it is not really necessary to place checks for infelicities produced by the MCMC sampler in your code. That is, it is not necessary for each proposed
to check for complex values or check that
does not result into a
(not-a-number) or that you do not get, say,
(yes these things can happen if you code
yourself without caution or when this is the result of some approximation, e.g. the expression for
is not analytically known).
In fact in these cases the uniform draw
in point 1 above will not be considered as smaller than
. So the comparison
will evaluate to a logical
statement and the proposed
will be rejected, as it should be.
However an exception to the above is when we evaluate
at the starting value
. In this case we do want to check that
is not something unwanted. For example, say that
(or
) and we just put this value at the denominator of the ratio in the definition of
(since this is the starting value it automatically goes into the denominator). Then at the first proposed
we have (assume
is finite)

Clearly no
will ever be accepted, therefore the value of
at the denominator will stay
for all iterations and the sampler will never move away from
.