Technical Description of the Estimation Algorithms
In this section, we describe the estimation methods currently available in Pumas.jl
. Specifically, we present their mathematical description and provide insight into the advantages and disadvantages of each method. A better understanding of the available estimation algorithms will facilitate modeling work by enabling a more informed and suitable choice of algorithm for the task at hand.
1. Categories of Estimation Algorithms
When estimating parameters in nonlinear mixed effects (NLME) models, several algorithmic approaches are available, which can be broadly categorized as follows:
1.1. Marginal vs. Joint Estimation
Marginal Estimation: In this approach, the random effects (subject-specific parameters) are integrated out, and inference is performed on the population parameters only.
Joint Estimation: Here, both the population parameters and the random effects are estimated simultaneously.
1.2. Bayesian Posterior Sampling vs. Bayesian Point Estimation vs. Frequentist Point Estimation
Bayesian Posterior Sampling: This approach samples from the posterior distribution of the model parameters, which can be either the marginal posterior (integrating out random effects) or the joint posterior (including both population parameters and random effects).
Bayesian Point Estimation: The Maximum A Posteriori (MAP) estimate finds the mode of the posterior distribution, providing a single "best" parameter set given the data and prior. As with posterior sampling, the posterior here can be either marginal (integrating out random effects) or joint (including both population parameters and random effects). This is a Bayesian analogue to maximum likelihood, incorporating prior information but not quantifying full uncertainty.
Frequentist Point Estimation: The Maximum Likelihood Estimate (MLE) finds the parameter values that maximize the likelihood function, without incorporating prior information. In NLME modeling, the likelihood can be either marginal (integrating out random effects) or conditional (conditioning on random effects). In practice, the marginal likelihood is almost always used, as the conditional likelihood is rarely appropriate for NLME models. When marginalizing the random effects, this approach is commonly referred to as empirical Bayes, because it treats the random effects in a probabilistic (Bayesian) manner while treating the population parameters in a frequentist (empirical) way. This is the most common approach for fitting NLME models to data.
In summary, estimation algorithms can be distinguished by whether they marginalize or jointly estimate random effects, and whether they provide full posterior distributions (Bayesian), point estimates with prior information (MAP), or point estimates without priors (MLE). The choice among these depends on the modeling goals, computational resources, and the need for uncertainty quantification.
It is important to note that point estimation methods such as MLE and MAP rely on asymptotic, large-sample assumptions and require that the model parameters are identifiable given the data and priors. These methods provide single "best" estimates and typically use the curvature of the likelihood or posterior to quantify uncertainty, which may not be reliable with a few samples or poorly identified models. In contrast, posterior sampling does not depend on these asymptotic or identifiability assumptions, as it characterizes the entire posterior distribution of the parameters. However, posterior sampling is generally much more computationally expensive. Bayesian methods such as MAP and posterior sampling also require careful specification of the prior distributions to ensure meaningful and reliable inference. For more details, see [1].
2. Notation
Notation | Definition |
---|---|
Data | |
$N$ | Number of subjects |
$y_i$ | All the observations of subject $i$ |
$m_i$ | $\mathrm{dim}(y_i)$, the number of observations for subject $i$ |
$y_{i,1:j}$ | The first $j$ observations of subject $i$, where $1 \leq j \leq m_i$ |
$y = \{ y_i,\, \forall\, i \in 1 \dots N \}$ | All the observations of all the subjects |
NLME modeling | |
$\theta$ | All the population parameters |
$p(\theta)$ | Prior distribution of the population parameters |
$\eta_i$ | All the random effects of subject $i$ |
$r$ | Number of random effects per subject |
$p(\eta_i \mid \theta)$ | Prior distribution of the random effects $\eta_i$ |
$\eta = \{ \eta_i,\, \forall\, i \in 1 \dots N \}$ | All the random effects of all the subjects |
Likelihoods | |
$l_i(\theta, \eta_i) = p(y_i \mid \theta, \eta_i)$ | Likelihood of $(\theta, \eta_i)$ given subject $i$'s data $y_i$; conditional probability of $y_i$ given $\theta, \eta_i$ |
$p(y_i \mid \theta) = \int p(y_i \mid \theta, \eta_i)\, p(\eta_i \mid \theta)\, d\eta_i$ | Marginal likelihood of $\theta$ for the $i$th subject |
$p(y \mid \theta) = \prod_{i=1}^N p(y_i \mid \theta)$ | Marginal likelihood of $\theta$ for all subjects |
Marginal likelihood approximation | |
$\Gamma_i(\eta_i)$ | Gradient vector of the log conditional likelihood with respect to $\eta_i$ given observations $y_i$ |
$\Delta_i(\eta_i)$ | Hessian matrix of the log conditional likelihood with respect to $\eta_i$ given observations $y_i$ |
$\Gamma_{ij}(\eta_i)$ | Gradient vector of the log conditional likelihood with respect to $\eta_i$ given observation $y_{ij}$ |
3. Marginal Likelihood Approximation
Whether one is maximizing the marginal likelihood, the marginal posterior, or sampling from the marginal posterior, an important step in all three estimation methods is to compute the marginal likelihood by integrating out (marginalizing) the random effects.
\[\begin{equation} \begin{aligned} p(y \mid \theta) & = \prod_{i=1}^N p(y_i \mid \theta) \\ & = \prod_{i=1}^N \int p(y_i \mid \theta, \eta_i) \cdot p(\eta_i \mid \theta) \, d\eta_i \end{aligned} \label{eq:marginal} \end{equation}\]
Note that, due to the exchangeability assumption, the marginal likelihood of the model given the whole population is the product of the marginal likelihoods for each individual. Unfortunately, the individual integrals in Equation \eqref{eq:marginal} are generally difficult to compute exactly. Therefore, a series of assumptions typically have to be made to allow for the approximation of the marginal likelihood for each subject.
The common approximations, listed in order of decreasing accuracy, are:
- Laplace approximation
- First order conditional estimate (FOCE) approximation
- First order (FO) approximation
We will now briefly describe each of these methods. For more details, see [2].
Stochastic approximation expectation-maximization (SAEM) is another commonly used marginal likelihood maximization algorithm that can, in general, be more accurate than maximizing the above three approximations. However, SAEM typically requires certain restrictions on the model structure for efficient implementation, as well as a large number of individual parameter samples to guarantee convergence. We do not cover SAEM further on this page.
3.1. LaplaceI()
The Laplace marginal likelihood approximation, called LaplaceI()
in Pumas.jl
, is obtained via a second-order Taylor expansion of the log of the integrand in the marginal likelihood integral. The Taylor series expansion is performed around the mode, $\eta_i^*$, of the integrand.
The mode $\eta_i^*$ is called the empirical Bayes estimate (EBE) of the random effects, or the conditional MAP estimate of $\eta_i$ given $\theta$, and it is obtained by optimizing the integrand with respect to $\eta_i$ for a specific value of $\theta$. Note that $\eta_i^*$ is therefore an implicit function of $\theta$.
\[\begin{equation} \text{EBE}_i = \eta_i^* = \arg \max_{\eta_i} \Big( p(y_i \mid \theta, \eta_i) \cdot p(\eta_i \mid \theta) \Big) \label{eq:ebes} \end{equation}\]
When the LaplaceI()
method is used to approximate the marginal likelihood, maximizing the marginal likelihood will involve two levels of estimation:
- For each subject $i$, estimating a point estimate for the random effects $\eta_i$ to maximize the integrand, with the population parameters $\theta$ held constant. This is used to construct the approximation of the marginal likelihood for each subject.
- Estimating the population parameters $\theta$ that maximize the marginal likelihood approximation given the whole population, using the individual marginal likelihood approximations obtained in step 1.
Let
\[\begin{equation} \begin{aligned} f(\eta_i) &= p(y_i \mid \eta_i, \theta)p(\eta_i \mid \theta) \\ g(\eta_i) &= \log f(\eta_i) \end{aligned} \label{eq:fg} \end{equation}\]
The general form of the second order Taylor expansion of a scalar-valued function $g(\eta_i)$ around a vector $\eta_i^*$ is:
\[\begin{equation} g(\eta_i) \approx g(\eta_i^*) + (\eta_i - \eta_i^*)^T \cdot g\prime(\eta_i^*) + \frac{1}{2} (\eta_i-\eta_i^*)^T \cdot g\prime\prime(\eta_i^*) \cdot (\eta_i-\eta_i^*) \label{eq:taylorseries} \end{equation}\]
where $g\prime$ and $g\prime\prime$ are the first and the second order derivatives of $g$, or the gradient vector and the Hessian matrix if $\eta_i$ is a vector. Recall, that the goal is to approximate the integrals $p(y_i \mid \theta) = \int f(\eta_i)d\eta_i$ for all $i$, so combining Equations \eqref{eq:fg} and \eqref{eq:taylorseries} we obtain:
\[\begin{equation} \begin{aligned} p(y_i \mid \theta) = \int f(\eta_i)\,d\eta_i &= \int \exp(g(\eta_i))\,d\eta_i \\ &\approx \int \exp \Big( g(\eta_i^*) + \tau^Tg\prime(\eta_i^*) + \frac{1}{2}\tau^Tg\prime\prime (\eta_i^*)\tau \Big) \,d\eta_i \end{aligned} \label{eq:use_taylor} \end{equation}\]
where $\tau = \eta_i - \eta_i^*$.
This integral has an analytic solution:
\[\begin{equation} \begin{aligned} p(y_i\mid\theta) &\approx f(\eta_i^*) \frac{1}{\sqrt{(2\pi)^r\vert -g_i\prime\prime(\eta_i^*)\vert}} \exp \Big( \frac{-g_i\prime (\eta_i^*)^T g_i\prime\prime(\eta_i^*)^{-1}g_i\prime(\eta_i^*)}{2} \Big) \end{aligned} \label{eq:nllmllapprox} \end{equation}\]
Initially, it may seem that the approximated likelihood does not depend on $\theta$, but recall that $\eta_i^*$ is an implicit function of $\theta$, and $\exp g(\eta_i) = f(\eta_i) = p(y_i \mid \eta_i, \theta) p(\eta_i \mid \theta)$ is an explicit function of $\theta$.
Since $\eta_i^*$ is the mode of the integrand, $g\prime(\eta_i^*)=0$ and the third term in Equation \eqref{eq:nllmllapprox} becomes $\exp(0) = 1$ and vanishes. This simplifies the approximation to:
\[\begin{equation} \begin{aligned} p(y_i\mid\theta) &\approx f(\eta_i^*) \frac{1}{\sqrt{(2\pi)^r\vert -g_i\prime\prime(\eta_i^*)\vert}} \end{aligned} \label{eq:simpler_nllmllapprox} \end{equation}\]
This completes the brief description of the Laplace approximation. Both subsequent marginal likelihood approximations build on top of it, introducing additional assumptions and approximations.
3.2. FOCE()
Even though the Laplace approximation makes the approximation of the marginal likelihood tractable, there are still significant computational costs associated with it, specifically with calculating the Hessian matrix $g_i\prime\prime(\eta_i^*)$. Therefore, a further assumption can be made, resulting in a lower accuracy marginal likelihood approximation called the first order conditional estimate (FOCE), or FOCE()
in Pumas.jl
.
First, let's express $g\prime(\eta_i^*)$ and $g\prime\prime(\eta_i^*)$ in terms of the conditional likelihood $p(y_i\mid\eta_i,\theta)$ and prior $p(\eta_i\mid\theta)$. Let:
- The gradient of the log conditional likelihood be $\Gamma_i(\eta_i) = \frac{\partial \log p(y_i\mid\eta_i\,\theta)}{\partial\eta_i}$, and
- The Hessian be $\Delta_i(\eta) = \frac{\partial\Gamma_i(\eta_i)}{\partial\eta_i}$.
We can express $g\prime(\eta_i^*)$ and $g\prime\prime(\eta_i^*)$ as:
\[\begin{equation} \begin{aligned} g\prime(\eta_i^*) &= \Gamma_i(\eta_i^*) + \frac{\partial \log p(\eta_i\mid\theta)}{\partial\eta_i}(\eta_i^*) \\ g\prime\prime(\eta_i^*) &= \Delta_i(\eta_i^*) + \frac{\partial^2 \log p(\eta_i\mid\theta)}{\partial\eta_i \partial\eta_i^T}(\eta_i^*) \end{aligned} \label{eq:g_delta_gamma} \end{equation}\]
The first and second derivatives of the log prior are often tractable and easy to compute, either analytically or via automatic differentiation. Most of the computational cost of the Laplace method is in computing $\Gamma_i(\eta_i^*)$ and $\Delta_i(\eta_i^*)$ using automatic differentiation.
In FOCE, the $\Delta_i(\eta_i^*)$ is approximated by its expectation wrt $y_i$:
\[\begin{equation} \Delta_i(\eta_i^*) \approx \mathbb{E}_{y_i}(\Delta_i(\eta_i^*)) \end{equation}\]
Since the expected value of the Hessian is equal to negative the covariance matrix of the score:
\[\begin{equation} \Delta_i(\eta_i^*) \approx -\mathbb{E}_{y_i}(\Gamma_i(\eta_i^*)\Gamma_i(\eta_i^*)^T) \end{equation}\]
Since all observations $y_{ij}$ for a subject $i$ are mutually independent, conditional on $\theta$ and $\eta$:
\[\begin{equation} \begin{aligned} \mathbb{E}_{y_i}(\Gamma_i(\eta_i^*)\Gamma_i(\eta_i^*)^T) &= \sum_{j=1}^{m_i} \mathbb{E}_{y_{ij}}(\Gamma_{ij}(\eta_i^*)\Gamma_{ij}(\eta_i^*)^T)\\ \Delta_i(\eta_i^*) &\approx -\sum_{j=1}^{m_i} \mathbb{E}_{y_{ij}}(\Gamma_{ij}(\eta_i^*)\Gamma_{ij}(\eta_i^*)^T) \end{aligned} \label{eq:hessian_approx} \end{equation}\]
where $m_i$ is the number of observations for subject $i$ and $\Gamma_{ij}(\eta_i^*)$ is the gradient of the log likelihood given observation $y_{ij}$ only, which is the observation for subject $i$ at time $t_j$. The expectation above for each $i$ and $j$ can be analytically derived for a number of distributions for $y_{ij}$, given the Jacobian of the summary statistics of the distribution of $y_{ij}$ wrt $\eta_i$. In Pumas, this Jacobian is computed using automatic differentiation.
Note that in Pumas, FO and FOCE are "with interaction", which means that the variance of the response variable can be a function of the random effects.
3.3. FO()
An even lower accuracy approximation can be achieved by assuming $\eta_i^* \approx 0$ (or any fixed value), which simplifies the calculations even further. This approximation to the marginal likelihood is called the first order (FO) approximation, or FO()
in Pumas.jl
. Since the random effects are not estimated in FO, maximizing the FO approximation involves a single level of optimization, making it faster than FOCE()
and LaplaceI()
. However, FO tends to have a large approximation error and should be used with care due to its general poor accuracy. Note that when $\eta_i^* = 0$, $g\prime(\eta_i^*)$ is no longer 0 and the marginal likelihood approximation is the one in equation \eqref{eq:nllmllapprox}. In FO, a similar Hessian approximation method is used as FOCE but with $\eta_i^* = 0$.
4. Marginal MAP estimation
One may often want to include pre-existing knowledge about the population parameters as a prior distribution $p(\theta)$ while still marginalizing the random effects in the likelihood. This necessitates a slight modification to the optimization objective, called the maximum a-posteriori (MAP), where $\theta^*$ is its mode:
\[\begin{equation} \begin{aligned} \theta^* & = \arg \max_\theta \Bigg( p(\theta) \cdot \prod_{i=1}^N p(y_i \mid \theta) \Bigg) \\ & = \arg \max_\theta \Bigg( p(\theta) \cdot \prod_{i=1}^N \int p(y_i \mid \theta, \eta_i) \cdot p(\eta_i \mid \theta) \, d\eta_i \Bigg) \end{aligned} \label{eq:map_mll} \end{equation}\]
Note the inclusion of $p(\theta)$. Even though the objective has been slightly modified, the underlying machinery remains the same and all of the above stipulations apply. In Pumas, MAP(LaplaceI())
, MAP(FOCE())
and MAP(FO())
can be used to do marginal MAP estimation, where the input algorithm is used to marginalize the random effects.
5. JointMAP
estimation
In a few cases, one may want to jointly estimate the population parameters and random effects by maximizing their joint probability. This method is useful when exploring a model's flexibility and ability to fit the data in the model development phase. However, it is well known that this approach can lead to inflated priors on the random effects and over-fitting if not carefully used. It is therefore only recommended in the model exploration phase or when a lot of data is available per subject. To maximize the joint probability in Pumas, you can use the JointMAP()
algorithm. JointMAP()
has a significantly lower computational cost because it does not need to marginalize the random effects, however it has a much higher chance of overfitting compared to marginalizing the random effects.
JointMAP()
optimizes the population parameters $\theta$ and the random effects $\eta$ jointly:
\[\begin{equation} \theta^*, \eta^* = \arg \max_{\theta, \eta} \Bigg( p(\theta) \cdot \prod_{i=1}^N p(y_i \mid \theta, \eta_i) \cdot p(\eta_i \mid \theta) \Bigg) \label{eq:jointmap} \end{equation}\]
Note that the priors over the population parameters $p(\theta)$ are included in this objective.
6. Marginal likelihood and generalization under partial data
One of the key benefits of using the marginal likelihood as an objective for NLME model parameter optimization is the in-built promotion of generalization performance for unseen data. To showcase this, we express the marginal likelihood in a slightly different way. Recall that for a single subject $i$, the marginal likelihood is defined via the following integral:
\[\begin{equation} p(y_i \mid \theta) = \int p(y_i \mid \eta_i, \theta) \cdot p(\eta_i \mid \theta) \, d\eta_i \end{equation}\]
However, we can write the marginal likelihood in a different way, viewing it as conditioning of future observations on past observations. Let $y_{i,1:j}$ be the first $j$ longitudinal observations of subject $i$, where $1 \lt j \leq m_i$ and $m_i$ is the number of longitudinal observations for subject $i$:
\[\begin{equation} p(y_i \mid \theta) = \prod_{j=1}^{m_i} p(y_{i,j} \mid y_{i,1:j-1}, \theta) \label{eq:mll_split} \end{equation}\]
We now consider two cases:
No past observations ($j = 1$)
\[\begin{equation} p(y_{i,1} \mid \theta) = \int p(y_{i,1} \mid \eta_i, \theta) \cdot p(\eta_i \mid \theta) \, d\eta_i \label{eq:j_1} \end{equation}\]
With past observations ($j > 1$),
\[\begin{equation} \begin{aligned} p(y_{i,1:j} \mid \theta) &= p(y_{i,j}, y_{i,1:j-1} \mid \theta) = p(y_{i,j} \mid y_{i,1:j-1}, \theta) \cdot p(y_{i,1:j-1} \mid \theta) \\ &= p(y_{i,1:j-1} \mid \theta) \cdot \int p(y_{i,j}, \eta_i \mid y_{i,1:j-1}, \theta) d\eta_i \\ &= p(y_{i,1:j-1} \mid \theta) \cdot \int p(y_{i,j} \mid \eta_i, y_{i,1:j-1}, \theta) \cdot p(\eta_i \mid y_{i,1:j-1}, \theta) d\eta_i \\ &= p(y_{i,1:j-1} \mid \theta) \cdot \int p(y_{i,j} \mid \eta_i, \theta) \cdot p(\eta_i \mid y_{i,1:j-1}, \theta) d\eta_i \end{aligned} \label{eq:j_gt_1} \end{equation}\]
Recursively applying the above formula to $p(y_{i,1:j-1} \mid \theta)$:
\[\begin{equation} p(y_{i,1:j} \mid \theta) = p(y_{i,1} \mid \theta) \cdot \prod_{j=2}^{m_i} \int p(y_{i,j} \mid \eta_i, \theta) \cdot p(\eta_i \mid y_{i,1:j-1}, \theta) d\eta_i \label{eq:mll_next_obs} \end{equation}\]
Here $p(\eta_i \mid y_{i,1:j-1}, \theta)$ is the posterior distribution of $\eta_i$ given partial data $y_{i,1:j-1}$. In other words, maximizing the marginal likelihood results in the maximization of the probability of the next observation given past observations, averaged over all the observations. Therefore, the marginal likelihood (unlike the joint probability) promotes generalization performance and discourages overfitting the training data.
7. Joint Posterior Sampling
Beside the point estimation methods, Pumas.jl
also supports a Bayesian posterior sampling method, which samples from the joint posterior distribution of the population parameters and random effects. This is useful for quantifying uncertainty in parameter estimates and for making probabilistic predictions. This can be done using the BayesMCMC
algorithm in Pumas.jl
, which implements the No-U-Turn Sampler (NUTS) algorithm, a variant of Hamiltonian Monte Carlo (HMC) family of algorithms. NUTS is a powerful Markov Chain Monte Carlo (MCMC) algorithm that can scale to models with many parameters and complex posterior distributions.
The joint posterior distribution is given by:
\[\begin{equation} p(\theta, \eta \mid y) = \frac{p(y \mid \theta, \eta) \cdot p(\eta \mid \theta) \cdot p(\theta)}{p(y)} \end{equation}\]
Note that this is the same as the joint MAP estimation objective, up to a constant, but instead of maximizing it, we sample from the full posterior. By sampling from the posterior, we quantify uncertainty and avoid the issues of over-fitting associated with joint MAP estimation.
8. Marginal Posterior Sampling
In addition to joint posterior sampling, Pumas.jl
also supports marginal posterior sampling, which samples from the marginal posterior distribution of the population parameters by integrating out the random effects. This is useful when the focus is on estimating the population parameters and not on individual subject posteriors. This can be done using the MarginalMCMC
algorithm in Pumas.jl
, which also implements the NUTS algorithm. MarginalMCMC
uses a dense mass matrix as part of the NUTS algorithm, as opposed to the diagonal mass matrix used in BayesMCMC
, which allows for more efficient sampling. However, each step in the MarginalMCMC
algorithm requires marginalizing the random effects which can be computationally expensive. The marginalization algorithm, such as LaplaceI()
, FOCE()
, or FO()
, is passed as an argument to MarginalMCMC
. MarginalMCMC
can also be used as an uncertainty quantification method for the population parameters, using infer
, after a marginal MAP or marginal maximum likelihood estimation step using fit
.