## State space models and intractable likelihoods

In this first series of posts, I introduce important tools to construct inference methods for the estimation of parameters $\theta$ in stochastic models. Stochastic models are characterized by randomness in their mathematical nature, and since at first I focus on models having dynamic features, these models are defined by stochastic processes.

I will start by introducing a class of dynamic models known as state space models.

For general (non-linear, non-Gaussian) state space models it is only relatively recently that a class of algorithms for exact parameter inference has been devised, in the Bayesian framework.  In a series of 4-5 posts I will construct the simplest example of this class of pseudo-marginal algorithms, now considered the state-of-art tool for parameter estimation in nonlinear state space models. Pseudo-marginal methods are not exclusively targeting state space models, but are able to produce exact Bayesian inference whenever a positive and unbiased approximation of the likelihood function is available, no matter the underlying model.

I will first define a state space model, then introduce its likelihood function, which turns out to be intractable. I postpone to the next post the construction of Monte Carlo methods for approximating the likelihood function.

### State space models

A very important class of models for engineering applications, signal processing, biomathematics, systems biology, ecology etc., is the class of state-space models (SSM). [In some literature the terms SSM and hidden Markov model (HMM) have been used interchangeably, though some sources make the explicit distinction that in HMM states are defined over a discrete space while in SSM states vary over a continuous space.]

Suppose we are interested in modelling a system represented by a (possibly multidimensional) continuous-time stochastic process $\{X_t\}_{t\geq 0}$, where $X_t$ denotes the state of the system at a time $t\geq 0$. The notation $\{X_t\}_{t\geq 0}$ denotes the ensemble of possible values taken by the system for a continuous time $t\geq 0$.

However, in many experimental situations the experimenter does not have access to  measurements from  $\{X_t\}_{t\geq 0}$ but rather to noisy versions corrupted with “measurement error”. In other words the true state of the system is unknown, because $\{X_t\}_{t\geq 0}$ is latent (unobservable), and we can only get to know something about the system via some noisy measurements. I denote the available measurements (data) with $y_1,y_2,...,y_T$ and use $\{Y_t\}_{t\geq 1}$ to denote the process producing the actual observations at $T$ discrete time points. For simplicity of notation I assume that measurements are obtained at integer observational times $\{1,2...,T\}$. Each $y_t$ can be multidimensional ($t=1,...,T$) but it does not need to have the same dimension of the corresponding $X_t$, for example some coordinate of $\{X_t\}_{t\geq 0}$ might be unobserved. Therefore, the only available information we get from our system is rather partial: (i) the system of interest is continuous in time but measurements are obtained at discrete times and (ii) measurements do not reflect the true state of the system $\{X_t\}_{t\geq 0}$, because the $y_1,...,y_T$ are affected with some measurement noise. For example we could have $y_t=X_t + \varepsilon_t$, with $\varepsilon_t$ some random noise.

In general $\{X_t\}_{t\geq 0}$ and $\{Y_t\}_{t\geq 1}$  can be either continuous– or discrete–valued stochastic processes. However in the following I assume both processes to be defined on continuous spaces.

I use the notation $z_{1:T}$ to denote a sequence $\{z_1,...,z_T\}$. Therefore, data can be written $y_{1:T}$. For the continuous time process $\{X_t\}_{t\geq 0}$ I use $X_{1:T}=\{X_1,...,X_T\}$  to denote realizations of the process at times $\{1,2...,T\}$. Clearly, none of the $X_t$ values is known.

Assume that the dynamics of $\{X_t\}_{t\geq 0}$  are parametrized by a model having a (vector) parameter $\theta$. The value of $\theta$ is unknown and our goal is to learn something about $\theta$ using available data. That is to say, we wish to produce inference about $\theta$. I could write $\{X_t(\theta)\}_{t\geq 0}$  in place of $\{X_t\}_{t\geq 0}$, but this makes the notation a bit heavy.

State space models are characterized by two properties: Markovianity of the latent states and conditional independence of measurements.

Markovianity: $\{X_t\}_{t\geq 0}$ is assumed a Markov stochastic process, with transition density $p(x_t|x_s)$, for $s. That is, “given the present state, the past is independent of the future”, so if time $s$ is the “present”, then $p(x_t|\text{history of } \{X_u\}_{u\leq s}, y_{1:t-1})\equiv p(x_t|x_s)$. Also in this case, for simplicity we assume implicit the conditioning on $\theta$, instead of writing $p(x_t|x_s;\theta)$. Specifically for our inference goals, we are interested in transitions between states corresponding to contiguous (integer) observational times, that is $p(x_t|x_{t-1})$. Also “the past is independent of the future, given the present”, meaning that $p(x_{t-1}|x_{t:T},y_{t:T})=p(x_{t-1}|x_t)$.

Conditional independence: measurements are assumed conditionally independent given a realization of the corresponding latent state $\{X_t\}$. That is, $p(y_t|x_{0:t},y_{1:t-1})=p(y_t|x_t)$.

Markovianity and conditional independence can be represented graphically:

Notice each white node $x_t$ is only able to directly influence the next white node $x_{t+1}$ and $y_t$. Also, each grey node $y_t$ is unable to influence other measurements directly. [This does not mean observations are independent: for example $p(y_2|y_1)=\frac{1}{p(y_1)}\int p(y_1,y_2,x_{1:2})dx_{1:2 }=\frac{1}{p(y_1)}\int p(y_1,y_2|x_{1},x_2)p(x_{1},x_2)dx_{1:2 }=\frac{1}{p(y_1)}\int p(y_2|x_2)p(y_1|x_1)p(x_2|x_1)p(x_1)dx_{1:2}$ and evidently it results in $p(y_2|y_1) \neq p(y_2)$. If equality did hold $y_2$ would be independent of $y_1$.]

Notice $x_0$ is the initial state of the system at some arbitrary time prior to the first observational time $t=1$. By convention we can set this arbitrary time to be $t=0$.

In summary, a compact, probabilistic representation of a state-space model is given by the conditional distribution of the observable variable at $t$ given the latent state, and the distribution of the evolution of the (Markovian) state, that is the transition density. Optionally, the initial state $x_0$ can be a fixed deterministic value or have its own (unconditional) distribution $p(x_0)$ which might depend or not on $\theta$.

\begin{aligned} Y_t | X_t = x_t & \sim p(y_t | x_t;\theta) &\mbox{(Observation density)} \\ X_{t } | X_{t-1} = x_{t-1} & \sim p(x_{t } | x_{t-1};\theta) &\mbox{(Transition density)} \\ X_0 & \sim p(x_0) &\mbox{(Initial distribution)}. \end{aligned}

Example: Gaussian random walk

A trivial example of a (linear, Gaussian) state space model is

\begin{aligned} y_t &= x_{t}+e_{t}, \quad e_{t}\sim_{iid} N(0,r^2)\\ x_t &= x_{t-1}+\xi_{t}, \quad \xi_{t}{\sim}_{iid} N(0,q^2),\quad x_0=0. \end{aligned}

with $\theta=(q,r)$. Therefore

\begin{aligned} p(y_t|x_{t})&=N(y_t;x_{t},r^2)\\ p(x_t|x_{t-1})&=N(x_t;x_{t-1},q^2). \end{aligned}

### Parameter inference and the likelihood function for SSMs

As anticipated, I intend to cover tools for statistical inference for the vector of parameters $\theta$, and in particular discuss Bayesian inference methods for SSM.

This requires introducing some quantities:

• $p(y_{1:T}|\theta)$ is the likelihood function of $\theta$ based on measurements $y_{1:T}$.
• In the Bayesian framework $\theta$ is a random quantity and $\pi(\theta)$ is its prior density (I always assume continuous-valued parameters). It encodes our knowledge about $\theta$ before we “see” the current data $y_{1:T}$.
• The Bayes theorem gives the posterior distribution of $\theta$, enclosing uncertainty about $\theta$ for given data:
$\pi(\theta|y_{1:T})=\frac{p(y_{1:T}|\theta)\pi(\theta)}{p(y_{1:T})}$.
• $p(y_{1:T})$ is the marginal likelihood (evidence), given by $p(y_{1:T})= \int p(y_{1:T}|\theta)\pi(\theta)d\theta$.
• inference based on $\pi(\theta|y_{1:T})$ is called Bayesian inference.

Goal: we wish to produce Bayesian inference on $\theta$. In principle this involves  writing down the analytic expression of $\pi(\theta|y_{1:T})$ and study its properties. However, for models of realistic complexity, what is typically performed is some kind of Monte Carlo sampling of pseudo-random draws from the posterior $\pi(\theta|y_{1:T})$. Then we can have a finite-samples approximation  of the marginal posteriors $\pi_j(\theta|y_{1:T})\equiv \pi(\theta_j|y_{1:T})$ ($j=1,...,\dim(\theta)$) compute the sample means of the marginals, quantiles etc. This way we perform uncertainty quantification for all components of $\theta$, for a given model and available data.

Now, the main problem preventing a straightforward sampling from the posterior, is that for nonlinear, non-Gaussian SSM the likelihood function is not available in closed form nor it is possible to derive a closed-form approximation. It is intractable. Let’s see why.

In a SSM data $y_1,...,y_T$ are not independent, they are only conditionally independent. This means that we cannot write $p(y_1,...,y_T|\theta)$ as a product of unconditional densities: instead we have

$p(y_{1:T}|\theta)=p(y_1|\theta)\prod_{t=2}^Tp(y_{t}|y_{1:t-1},\theta)$

with the convention $p(y_1|y_{1:0},\theta)\equiv p(y_1|\theta)$.

The problem is that all densities in the expression above are unavailable in closed form, hence unknown. If these were available we could either use an algorithm to find a (local) maximizer $\hat{\theta}$ to $p(y_{1:T}|\theta)$ (the maximum likelihood estimate of $\theta$), or plug the likelihood into the posterior $\pi(\theta|y_{1:T})$ and perform Bayesian inference.

The reason for the unavailability of a closed-form expression for the likelihood is the latency of process $\{X_t\}_{t\geq 0}$, on which data depend. In fact we have:

\begin{aligned} p(y_{1:T}|\theta)&=\int p(y_{1:T},x_{0:T}|\theta)dx_{0:T} = \int p(y_{1:T}|x_{0:T},\theta)\times p(x_{0:T}|\theta)dx_{0:T}\\ &=\int \prod_{t=1}^T p(y_{t}|x_{t},\theta)\times \biggl\{p(x_0|\theta)\prod_{t=1}^T p(x_{t}|x_{t-1},\theta)\biggr\} dx_{0:T}. \end{aligned}

The expression above is intractable for two reasons:

• it is a $(T+1)$-dimensional integral, and
• for most (nontrivial) models, the transition densities $p(x_{t}|x_{t-1})$ are unknown.

Basically the only way to solve the integral when $T$ gets large is via Monte Carlo methods. A special case amenable for closed-form computation is the linear SSM with Gaussian noise (see the Gaussian random walk example): for this case the Kalman filter can be employed to return exact maximum likelihood inference. In the SSM literature important (Gaussian) approximations are given by the extended and unscented Kalman filters.

However, for nonlinear and non-Gaussian SSM, sequential Monte Carlo methods (or “particle filters”) are presently the state-of-art methodology. Sequential Monte Carlo (SMC) methods will be considered in a future post. SMC is a convenient tool to implement the pseudo-marginal methodology for exact Bayesian inference that I intend to outline in this first series of posts.

### Summary

I have introduced minimal notions to set the inference problem for parameters of state space models (SSM). The final goal is to summarize a methodology for exact Bayesian inference, the pseudo-marginal method. This will be outlined in future posts, with focus on SSM. I have also stated some of the computational issues preventing a straightforward implementation of likelihood based methods for the parameters of SSM. In the next two posts I consider some Monte Carlo strategies for approximating the likelihood function.