Tips for coding a Metropolis-Hastings sampler

I will suggest several tips, and discuss common beginner’s mistakes occuring when coding from scratch a Metropolis-Hastings algorithm. I am making this list from the top of my mind, so feel free to propose suggestions by commenting to this post. Most items are related to coding practice rather than actual statistical methodology, and are often not mentioned in widely available literature. If you are an instructor, you might find it useful to mention these issues in your class, to avoid students getting puzzled when coding mathematically flawless algorithms having a misbehaving implementation.

Notice, this post does not discuss the design of a Metropolis-Hastings sampler: that is important issues affecting MCMC, such as bad mixing of the chain, multimodality, or the construction of “ad hoc” proposal distributions are not considered.

I am assuming the reader already knows what a Metropolis-Hastings algorithm is, what it is useful for, and that she has already seen (and perhaps tried coding) some implementation of this algorithm. Even given this background knowledge, there are some subtleties that are good to know and that I will discuss below.

Briefly, what we wish to accomplish with Markov chain Monte Carlo (MCMC) is to sample (approximately) from a distribution having density or probability mass function \pi(x). Here we focus on the Metropolis-Hastings algorithm (MH). In order to do so, assume that our Markov chain is currently in “state” x, and we wish to propose a move x\rightarrow x'. This move is generated from a proposal kernel q(\cdot|x). We write the proposed value x' as x'\sim q(x'|x). Then x' is accepted with probability \alpha or rejected with probability 1- \alpha where

\displaystyle  \alpha=\min\biggl\{1,\frac{\pi(x')}{\pi(x)}\frac{q(x|x')}{q(x'|x)}\biggr\} .

Then \alpha is called acceptance probability.

A very simplified pseudo-algorithm for a single step of MH looks like:

propose x'\sim q(x'|x)
accept x' with probability \alpha
otherwise reject x'

However, since we wish to sample say R times from \pi(x), we better store the result of our simulations into a matrix \texttt{draws} and write (assuming an arbitrary initial state x_0)

for r = 1 to R
   propose x'\sim q(x'|x_{r-1}) 
   compute \alpha=\min\biggl\{1,\frac{\pi(x')}{\pi(x_{r-1})}\frac{q(x_{r-1}|x')}{q(x'|x_{r-1})}\biggr\} 
   accept x' with probability \alpha 
          then store it into draws[r]:= x'
          set \pi(x_r):=\pi(x'), set x_r:=x' 
   otherwise store draws[r]:= x_{r-1}

Here \texttt{:=} means “copy the value at the right-hand-side into the left-hand-side”. After a few (sometimes many) iterations the chain  reaches stationarity, and its stationary distribution happens to have  \pi(x) as probability density (or probability mass function).  In the end we will explore the content of \texttt{draws} and disregard some of the initial values, to wash-away the influence of the arbitrary x_0.

1- You are given the pseudo-code above but it’s still unclear how to use it for sampling from \pi(x)

The very first time I heard of MH (self study and no guidance available) was in 2008. So I thought “ok let’s google some article!” and I recall that all resources I happened to screen were simply stating (just as I wrote above) “accept the move with probability \alpha and reject otherwise”. Now, if you are new to stochastic simulation the quoted sentence looks quite mysterious. The detail that was missing in my literature search was the logical link between the concept of acceptance probability and that of sampling.

Here is the link: you have to randomly decide between two possible outcomes of a random decision: accept or reject a move, and each of these two outcomes has its own probability (\alpha for accepting and 1-\alpha for rejecting).  So  our decision is a random variable taking two possible values, hence is a Bernoulli random variable with “parameter” \alpha. And here is the catch: to sample from a discrete distribution (like a Bernoulli distribution) you can use the inverse transform method, which in the special case of a binary decision translates to  (i) sample a pseudo-random number u from a uniform distribution on the interval (0,1), that is u\sim U(0,1) which can be obtained in R via \texttt{runif(1)}, and (ii) if u<\alpha accept x' otherwise reject. That’s it (*).

(*) Well, yes that’s it. However the coding of the implementation can be simplified to be just: if u<\frac{\pi(x')}{\pi(x)}\frac{q(x|x')}{q(x'|x)} then accept x' otherwise reject it. This is because u\in (0,1) hence whenever \pi(x')q(x|x')/\pi(x)q(x'|x) is larger than 1, then certainly u< \pi(x')q(x|x')/\pi(x)q(x'|x).

In conclusion for a single MH step we substitute (and for simplicity here we remove the dependency on the iteration index r we introduced in the previous code)

compute \alpha=\min\biggl\{1,\frac{\pi(x')}{\pi(x)}\frac{q(x|x')}{q(x'|x)}\biggr\} 
accept x' with probability \alpha


simulate u\sim U(0,1)
if u<\frac{\pi(x')}{\pi(x)}\frac{q(x|x')}{q(x'|x)}
   then \pi(x):=\pi(x') and x:=x'

2- Do not keep only the accepted draws!

A common mistake is not advancing the iteration counter until we have an accept. This is wrong (hence if you ever coded a MH algorithm using a \texttt{while} loop it is fairly likely you may have acted the wrong way).  In the implementation given in the introduction to this post (the one using the \texttt{for} loop) you may notice that when a proposal is rejected I store the last accepted value as \texttt{draws[r]}:= x_{r-1}.

3- Save and reuse the values at the denominator

If you compare the pseudo-codes suggested above you may have noticed that when a proposal is accepted I “save” \pi(x):=\pi(x'), in addition to storing the accepted proposal x'. While it is not necessary to do the former (and in fact most descriptions of MH in the literature do not mention this part) it is a waste of computer time not to keep in memory the accepted \pi(x'), so that we do not need to recompute it at the next iteration (since it is reused at the denominator of \alpha).

This is especially relevant when MH is used for Bayesian inference, where \pi(x) is a posterior distribution \pi(x|\mathrm{data}) such that \pi(x|\mathrm{data})\propto p(\mathrm{data}|x)p(x) for some “prior” p(x) and “likelihood function” p(\mathrm{data}|x). In fact, it is often the case that the pointwise evaluation of the likelihood is far from being instantaneous, e.g. because the data has thousands of values, or the evaluation of the likelihood itself requires some numerical approximation.

4- Code your calculations on the log-domain.

Depending on cases, \pi(x) may sometimes take very large or very small values. Such extremely large/small values may cause a numeric overflow or an underflow, that is your computing environment will be unable to represent these numbers correctly. For example a standard Gaussian density is strictly positive for all real x, but if you evaluate it at x=40 your software will likely return zero (this is an underflow).

For example, in R type \texttt{dnorm(40)} and you’ll see it returns zero .

Therefore instead of using

if u<\frac{\pi(x')}{\pi(x)}\frac{q(x|x')}{q(x'|x)} accept x' 
   otherwise reject


if \log(u)<\log\pi(x')-\log \pi(x)+\log q(x|x')-\log q(x'|x) accept x' 
   otherwise reject

Anyhow, while this recommendation is valid, it might not be enough without further precautions, as described in the next point.

5- Do as many analytic calculations as feasible in order to prevent overflow/underflow.

Strongly connected to the previous point is coding \log(\pi(x)) without doing your calculations analytically first. That is, I see beginners trying to use the log-domain suggestion from point 4 to evaluate \texttt{log(dnorm(40))}. This will not help because the log will be applied to a function \texttt{dnorm(40)} that has already underflown, and the logarithm of zero is taken.

In R you can use \texttt{dnorm(40,log=TRUE)} to compute \mathrm{logpdf}=-0.5\log(2\pi)-0.5x^2 which is the safe way to evaluate the (normalised) standard Gaussian log-density. For other languages you may need to code the log-density yourself.

However, since we are interested in using MH, we may even just code the unnormalised log-density


where I have omitted the irrelevant (for MH) normalization constant -0.5\log(2\pi). This brings us to the next point.

6- Remove proportionality constants.

An important feature of MH is that it can sample from distributions that are known up to proportionality constants (these constants are terms independent of x). This means that \pi(x) can be an unnormalised density (or an unnormalised probability mass function). As mentioned in the previous point, we do not strictly need to write   \texttt{logpdf}=-0.5\log(2\pi)-0.5x^2 but can get rid of the normalization constant -0.5\log(2\pi) that anyhow simplifies out when taking the ratio \pi(x')/\pi(x).

The simplification of proportionality constants is not just a matter of coding elegance, but can affect the performance of the algorithm. For example if \pi(x) is a Gamma distribution, such as x\sim \mathrm{Gamma}(a,b) then \pi(x) contains Gamma functions \Gamma(a) as part of the proportionality constant. When a is not an integer but is a positive real number the evaluation of \Gamma(a) involves the numerical approximation of an integral. Now, this approximation is not honerous if executed only a few times. However, often MCMC algorithms are run for millions of iterations, and in such cases we can save computational time by removing constants we don’t need anyway. In conclusion, you can safely avoid coding \Gamma(a)  (or its logarithm) as it is independent of x.

7- Simplify out symmetric proposal functions

A symmetric proposal function is such that q(x'|x)\equiv q(x|x') for every x and x'. Therefore symmetric proposals do not need to be coded when evaluating the ratio in \alpha as they simplify out.

For example, assume a multivariate Gaussian proposal function written as (with some abuse of notation) q(x'|x)=\mathcal{N}(x,\Sigma), that is a Gaussian density with mean x and covariance matrix \Sigma. Then we have


where |\cdot| denotes is the determinant of a matrix and T denotes transposition. Clearly this q(\cdot) is symmetric because of the quadratic form at the exponent.

8a- Remember the transformation Jacobians!

Most previous recommendations are related to numerical issues in the implementation of MH. However this one alerts you of an actual statistical mistake. I introduce the problem using a specific example.

A Gaussian proposal function q(\cdot|x) is a simple choice that often works well in problems where x is low dimensional (e.g. x is a vector of length up to, say, five elements) and the geometry of \pi(x) is not too complicated. The fact that a Gaussian proposal has support on the entire real space makes it a useful default choice. However, when some elements of x are constrained to “live” in a specific subset of the real space, we should be aware that a significant proportion of proposed values will be rejected (unless we “design” q(\cdot) to intelligently explore only a promising region of the space). Wasting computational time is not something we want right?

For the sake of illustration, let’s assume that x is a positive scalar random variable (see point 8b below for a more general discussion). We could code a Metropolis random walk sampler with Gaussian innovations as

x'=x+\delta \cdot \xi

where the “step length” \delta>0 is chosen by the experimenter and \xi\sim \mathcal{N}(0,1) is sampled from a standard Gaussian. Therefore we have that q(x'|x)= \mathcal{N}(x,\delta^2). Clearly x' will certainly be rejected whenever x'\leq 0 since we are assuming that the support of the targeted \pi(x) is positive.

A simple workaround is to propose \log x' instead of x', that is code

\log x'=\log x+\tilde{\delta}\cdot \xi

for some other appropriate \tilde{\delta}>0. Then we evaluate \pi(x') as \pi(\exp(\log x')), that is we exponentiate the proposal \log x' before “submitting” it to \pi(\cdot). Now if this is all we do, we have a problem.

Thing is we proposed from a Gaussian, and then we thought that exponentiating the proposal is just fine. It is almost fine, but we must realise that the exponential of a Gaussian draw is log-normal distributed and we must account for this fact in the acceptance ratio! In other words we sampled \log x' using a proposal function that here I denote q_1(\cdot) (which is Gaussian) and then by exponentiating we effectively sampled x' from another proposal q_2(\cdot) (which is log-normal).

Therefore, from the definition of log-normal density \mathcal{LN} we have (recall in this example x is scalar)

q_2(x'|x)=\frac{1}{x'{\delta}(2\pi)^{1/2}}\exp(-\frac{1}{2{\delta}^2}(\log x'-\log x)^2)

or equivalently x'\sim \mathcal{LN}(\log x,\delta^2).

Then we notice that this proposal function is not symmetric (see point 7) and \frac{q_2(x|x')}{q_2(x'|x)} = \frac{x'}{x}. Hence the correct sampler does the following: simulate a u\sim U(0,1) and using all the recommendations above we

\text{accept } x' \text{ if } \log u< (\log\pi(x')-\log \pi(x))+(\log x'-\log x)

and reject x' otherwise.

But how would we proceed if we were not given the information that the exponential of a Gaussian variable is a log-normal variable? A simple result on the transformation of random variables will assist us, when the transformation g(\cdot) is an invertible function. Say that for a generic x\sim \mathcal{N}(\mu_x,\sigma_x) we seek for the distribution of y=g(x)=\exp(x), and using the transformation rule linked above we have that y has density function p(y)=\frac{1}{\sigma_x(2\pi)^{1/2}}\exp(-\frac{1}{2\sigma_x^2}(\log y-\mu_x)^2)\times \frac{1}{y}. Notice here 1/y is the “Jacobian of the transformation”.

Therefore, we should not forget the “Jacobians” when transforming the proposed draw. In our example above y is x' and the ratio \frac{x'}{x} in the acceptance probability is the ratio of Jacobians.

8b- Recall the tranformation Jacobian! Part 2

The discussion in 8a above treats the specific case of using a random walk proposal to target a distribution with positive support. How about different type of constraints? My colleague at Lund University Johan Lindström has written a compact and useful reference table, listing common transformations of constrained proposals generated via random walk.

See also an excellent and detailed post by Darren Wilkinson.

9- Check the starting value of \pi(x)

Generally it is not really necessary to place checks for infelicities produced by the MCMC sampler in your code. That is, it is not necessary for each proposed x' to check for complex values or check that \pi(x') does not result into a \texttt{NaN} (not-a-number) or that you do not get, say, \pi(x')=-\infty (yes these things can happen if you code \pi(x) yourself without caution  or when this is the result of some approximation, e.g. the expression for \pi(x) is not analytically known).

In fact in these cases the uniform draw  u in point 1 above will not be considered as smaller than \alpha. So the comparison u < \alpha will evaluate to a logical \texttt{false} statement and the proposed x' will be rejected, as it should be.

However an exception to the above is when we evaluate \pi(x_0) at the starting value x_0. In this case we do want to check that \pi(x_0) is not something unwanted. For example, say that \pi(x_0)=+\infty (or -\infty) and we just put this value at the denominator of the ratio in the definition of \alpha (since this is the starting value it automatically goes into the denominator). Then at the first proposed x' we have (assume \pi(x') is finite)

\displaystyle   \alpha=\min\biggl\{1,\frac{\pi(x')}{+\infty}\frac{q(x|x')}{q(x'|x)}\biggr\} =0 \qquad (\text{and the same for } -\infty \text{ at the denominator}).

Clearly no x' will ever be accepted, therefore the value of \pi(\cdot) at the denominator will stay \infty for all iterations and the sampler will never move away from x_0.

About Umberto P
This entry was posted in Uncategorized and tagged , , , , , , , , , . Bookmark the permalink.

15 Responses to Tips for coding a Metropolis-Hastings sampler

  1. Pingback: Why and how pseudo-marginal MCMC work for exact Bayesian inference | Umberto Picchini's research blog

  2. viiv says:

    Thank you for the very useful post, especially 8a and 8b, this is while looking for info on this that I came here.
    There is one point though that is not developed and that causes me trouble in the implementation:

    Let’s assume like in the example that my proposal is a Normal with sd ‘sigma’ (represented by ‘delta’ in your example), and that I need the support to be on [a,b], so I apply the corresponding change of variables listed in the PDF sheet. Now I will draw my new variable with respect to a Normal still, but what should the sd of that Normal be? Because in your post you say “for some other appropriate ‘delta^\tilde’ ” but I have not found exactly how to define it yet.


    • Thanks!
      \delta and \tilde{\delta} are tuning parameters, that have really nothing to do with the problem treated in 8a-8b- These are simply standard deviations for the proposal mechanism. If you have to hand-pick \tilde{\delta}, just set it small enough that the chain starts moving. There are not other theoretical issues you should be worried about. In fact, it will be the targeted distribution (equipped with the appropriate Jacobian) that will take care of rejecting proposals outside [a,b]. For example a post by Darren Wilkinson, shows how to target a positive distribution using a Gaussian proposal, see

      More specifically. Standard deviations for proposal densities should either (i) be set by the statistician appropriately, if he/she knows something about the “scale” at which the targeted variable x varies and if the size (dimension) of x is not too large; or (ii) let the standard deviation of the proposal be updated according to some adaptive method. A simple and often effective scheme is (the main algorithm is at the beginning, while the rest of the paper is about its theoretical validation).

      So, all in all this is part of the design of an efficient MCMC algorithm that, as I explicitly stated, I wish to avoid discussing as it is problem dependent (“…, or the construction of “ad hoc” proposal distributions are not considered.”)


      • viiv says:

        Thanks a lot for your reply!

        I understand exactly what you are talking about because I have precisely implemented Haario’s AM, and I had read Darren Wilkinson’s post 🙂

        The background is that I am doing parameter estimation with a model that is a black-box function that has no analytical form, I can just run it fast. I implemented MCMC with AM with a multivariate Gaussian proposal, and it works decently well.
        But there are a couple variables that have a support on some [a,b], so I want to try what you wrote with the transformed variables ‘trick’ to improve the acceptance rate (instead of having many steps rejected since the likelihood will be bad, but it costs runs of the function, and so wastes computational time as you state).

        My point is that during AM, at the covariance update step (I say covariance matrix for generality, since I assume the multivariate case here, as that’s what I work with, but it shouldn’t change from the 1D case), you have this simple recurrence formula that gives the new covariance of the ‘real’ parameters, not the transformed ones. But now with your scheme, if I understood correctly, I have to draw my new parameters in the ‘transformed space’ as a function of the current transformed parameters, and a certain covariance. And if I want to preserve the efficiency of AM, this covariance cannot be whichever one I want right? Doesn’t it have to be some function of the new one I just calculated? I just would like to know what is this ‘some function’ that gives the optimal covariance for the transformed parameters.

        It is possible I misunderstood something, so to be extra clear (and I am sure it could serve someone else stumbling on this post), here is what I do now with standard MCMC with AM:

        0) I have my current parameters X (here they can be in ]-Inf,+Inf[, but they actually should be in [a,b] for efficiency.)
        1) I update Sigma, the covariance matrix, with the convenient recurrence formula of Haario
        2) I draw X’ ~ Normal(X, Sigma)
        3) I compute the new likelihood \pi(X’)
        4) I check the acceptance by looking at \pi(X’) / \pi(X) (actually I do it with the log as you suggested 🙂 )
        5) I update my chain with either X’ if there is acceptance, or X if not

        Now with your scheme, if I understood correctly, the steps are as follows:

        0) I have my current parameters X
        1) I update Sigma, the covariance matrix, with the convenient recurrence formula of Haario
        1′) I transform them: Y = g(X) = (b+a*exp(-x))/(1+np.exp(-x)) (note: working with exp(-x) is better than exp(x) for numerical purposes)
        2) I draw Y’ ~ Normal(Y, Sigma’) <—– What is Sigma' ???
        2') I compute X' = g^-1(Y') for point 4) and 5)
        3) I compute the new likelihood \pi(Y')
        4) I check the acceptance by looking at \pi(Y') / \pi(Y) * (X'-a)(b-X') / (X-a)(b-X)
        5) I update my chain with either X' if there is acceptance, or X if not

        So as you can see at 2) in the new scheme, I am stuck because I am not sure which sigma to use. I hope I made it clear enough, please don't hesitate to ask if it isn't 🙂


      • honestly, I would set the entire inference by proposing from the transformed parameter immediately, so that you do not compute S and then wonder what S’ should be like. For example, say that you have two parameters x1 and x2. Say that you want parameter x1 to be in [a,b] with a>0 and b>0, whereas x2 is defined on the entire real line. So X =(x1,x2). However set all your code to work with Y = (logx1,x2). So you propose Y’ ~ N(Y,Sigma) where Sigma is automatically updated via Haario’s. Then of course your model simulator/likelihood does not take logx1 as input, and instead wants x1. So take x1:=exp(logx1) before feeding it to the model. Then place a prior on x1 (not logx1), one that has support on [a,b] (truncated normal, uniform, whatever…), and finally multiply by the Jacobian J= x1*/x1_old. This way your prior takes care of the fact that x1 has to be in [a,b], while the Jacobian corrects for the fact that you propose logx1. In the end you will have a chain for logx1 and one for x2. Do you want to report posterior marginals for x1? Take the exponential of the produced chain.

        Now, if this is still causing you troubles, meaning that way too many proposals are produced that end outside the bounds, and this is an issue because your simulator is toooo expensive, then I wonder whether there is actually a problem with your model representing your data well. Perhaps there is a modelling issue and your interaction between model and data suggests that areas outside of [a,b] are not implausible.


  3. viiv says:

    Thanks again for the reply!

    Your suggestion makes a lot of sense and sounds easier to implement (though I would really like to know what Sigma’ to use, in a theoretical sense 🙂 ).
    But there is one point that I do not understand in your scheme:
    it seems to me that you propose twice x1, since you propose first Y’ = (log(x1′), x2′) ~ N(Y =(log(x1), x2), Sigma), and then you say to put a prior on x1′ (let’s say a Uniform(a,b)), so I don’t really understand how that would work in practice? Wouldn’t that cancel completely the proposal on log(x1′) if I put a Uniform(a,b) prior on x1′ independently of what I just got as a proposal from the Normal?

    If I understand, you suggest:

    0) Outside of the Metropolis loop, go from X = (x1, x2) to Y = (log(x1), x2)
    In the Metropolis loop:
    1) I have my current parameters Y
    2) I update Sigma, the covariance matrix, with the convenient recurrence formula of Haario
    3) I draw Y’ ~ Normal(Y, Sigma)
    4) I take x1′ = exp(log(x1′))
    5) x1′ ~ Uniform(a,b) <—- This destroys my proposed x1' that I just had?!
    6) I compute the new likelihood \pi(x1’, x2')
    7) I check the acceptance by looking at \pi(X’) / \pi(X) * X'/X
    8) I update my chain with either Y’ if there is acceptance, or Y if not

    As you can see, step 5 does not make a lot of sense to me, but I must have misunderstood something you said.
    It seems to me that it would be even easier to use instead of log, the good function such that Y=(g^-1(x1) = log(x1-a)-log(b-x1), x2), outside the Metropolis loop (then x1=g(y1) is in [a,b]), and put the good Jacobian for the ratio…no?


    • yes the above is about correct (fixes are below), but first a comment regarding step 5: if with “x’~U(a,b)” you mean “I evaluate the prior U(x’;a,b)” (rather than “I sample x’ from U(a,b)”) then it is ok, and it is not that you are “destroying the proposed x1′”. We are free to propose parameters the way we prefer, in this case using a Gaussian, and then have whatever prior is sutable for our data-modelling problem. The proposal just (well) proposes values in a, hopefully, efficient way (the ideal would be to have a proposal very similar to the posterior target). But you can use whatever prior makes sense for your problem even if this is not the same as your proposal.

      So, I think it should be:
      0) Outside of the Metropolis loop, go from X = (x1, x2) to Y = (log(x1), x2)
      In the Metropolis loop:
      1) I have my current parameters Y
      2) I update Sigma, the covariance matrix, with the convenient recurrence formula of Haario
      3) I draw Y’ ~ Normal(Y, Sigma)
      4) I take x1′ = exp(log(Y′))
      5) compute the priors \pi(x1′) and \pi(x2′), where for x1’we have say \pi(x1′)=U(x1′;a,b)
      6) I compute the new likelihood \pi(x1’, x2′)
      7) I check the acceptance by looking at (pi(x1’, x2′)/pi(x1, x2)) * (pi(X’) / pi(X)) * X’/X
      8) I update my chain with either Y’ if there is acceptance, or Y if not


    • also see the reply I sent to your Github email address a few days ago


  4. Jim says:

    Dear Professor Picchini:

    Thanks for the post. Very useful.

    However, there is one part of your reply to viiv of 2 October 2018 that is unclear to me.

    Step 7 states “check the acceptance by looking at (pi(x1’, x2′)/pi(x1, x2)) * (pi(X’) / pi(X)) * X’/X.”

    Is the last ratio, X’/X, element by element division or something else?

    Best regards,



  5. Oden says:

    Thanks for thsi guide. I’ve a fair knowledge in R and substantial theory in MCMC. Despite this, however, I have for the past 1 month tried to implement the model in the following paper (“A Bayesian Semi-Parametric Model for Small Area Estimation”, by Donald Malec and Peter Muller) without success. I’ve created a similar (fake) data for purposes of understaning modelling using Dirichlet Process Mixture (DPM) priors.
    Any help on how to go about this would be highly apprecaited. I badly need to understand how to code this in R.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s