# Bayesian Workflow

You can use Pumas to perform a full Bayesian analysis workflow including:

• Complex hierarchical and dynamics-based model definition
• Data wrangling
• Simulation
• Posterior sampling
• Convergence diagnostics
• Posterior predictions and probabilistic queries
• Model selection and plotting

In this section, an overview of the Bayesian workflow in Pumas will be presented. To better understand this section, some familiarity with Bayesian inference is required. You can learn about the basics of Bayesian inference using Jose Storopoli (2021). It may also be useful to read this section more than once to fully understand all the concepts since some concepts are related to each other in a circular fashion.

The Bayesian workflow allows analysts to:

1. Incorporate domain knowledge and insights from previous studies using prior distributions.
2. Quantify the epistemic uncertainty in the model parameters' values. Parameter values can be uncertain in reality due to model non-identifiability or practical non-identifiability issues because of lack of data.

This epistemic uncertainty can be further propagated forward via simulation to get a distribution of predictions which when used to make decisions, e.g. dosing decisions, can make those decisions more robust to changes in the parameter values. The above are advantages of Bayesian analysis which the traditional frequentist workflow typically doesn't have a satisfactory answer for. Using bootstrapping or asymptotic estimates of standard errors can be considered somewhat ad-hoc and assumptious methods to quantify uncertainty in the parameter estimates when Bayesian inference uses the established theory of probability to more rigorously quantify said uncertainty with fewer assumptions about the model.

It is important to note that one can still reap the second benefit of Bayesian analysis due to its flexibility even when little to no domain knowledge is imposed on the model. So the Bayesian workflow doesn't force you to incorporate your domain knowledge but it empowers you to if you want.

## Data wrangling and dose regimens

Data wrangling for Bayesian inference in Pumas is identical to regular data wrangling for frequentist workflows. The data can be read from files to data frames and converted to a Pumas population. For more on data wrangling in Pumas and defining dosage regimens, you can check the Dosage Regimens, Subjects, and Populations section of the documentation.

## Model definition

You can define a probabilistic hierarchical model using the @model macro in Pumas. The @model macro is extensively explained in the Defining NLME models in Pumas section of the documentation. The Bayesian workflow in Pumas works with the full range of models you can write using the @model macro including models with complicated dynamics, dose regimens and a large number of population or subject-specific parameters.

### Prior distributions available

(-∞, ∞)(0, ∞)
(0, ∞)(0, 1)

Prior distributions can be used to encode prior knowledge about the values of the parameters in both the @param and @random blocks in a Pumas model. A number of prior distributions are available for use in Pumas with widely different probability density functions (PDFs) as shown in the figure above. Univariate distributions can be used as priors for scalar parameters. Multivariate distributions can be used as priors for vector parameters. And finally matrix-variate distributions can be used as priors for matrix parameters. The following are some of the most used priors available in Pumas (although there are many more available):

• Normal(μ, σ): univariate normal distributions with support (-∞, ∞), mean μ and standard deviation σ.
• LogNormal(μ, σ): univariate log normal distribution with support (0, ∞) and a base normal distribution of mean μ and standard deviation σ.
• MvNormal(μ, Σ): multivariate normal distribution with mean vector μ and covariance matrix Σ. Σ can also be a diagonal matrix, e.g. Diagonal([1.0, 1.0]). If a scalar is input for Σ, it is treated as the standard deviation of all the independent random variables. You can also pass Σ alone as a matrix, e.g. MvNormal(Σ::AbstractMatrix), and the means will be assumed to be 0.
• MvLogNormal(μ, Σ): a multivariate log normal distribution over positive vectors with a base multivariate normal distribution with parameters μ and Σ. The μ and Σ here are identical to the MvNormal ones above.
• Cauchy(μ, σ): a univariate Cauchy distribution with support (-∞, ∞) location μ and scale σ.
• Constrained(dist; lower, upper): a constrained prior distribution with a fixed support (lower, upper) and a fixed base distribution dist that could be any univariate or multivariate distribution. lower and upper are optional keyword arguments that set the bounds on the random variables' support. When dist is a univariate distribution, lower and upper should be scalars. When constraining multivariate distributions, lower and upper can be vectors or scalars. If scalar, the same bound will be used for all random variables. If unset, lower will be -∞ and upper will be ∞. There is also a truncated distribution which is different from Constrained in that it allows the base distribution to be a function of the model's parameters but truncated only supports univariate base distributions. In general, it's recommended to use Constrained in the @param block and truncated in the @random and @derived blocks. Examples:
• Constrained(Normal(); lower = 0.0) is a half normal distribution.
• Constrained(Cauchy(); lower = 0.0) is a half Cauchy distribution.
• Constrained(MvNormal([0.0, 0.0], [1.0 0.0; 0.0 1.0]); lower = 0.0) is a constrained multivariate normal distribution.
• truncated(dist, lower, upper): similar to Constrained with fixed lower and upper bounds lower and upper respectively and a base distribution dist. In truncated, the base distribution dist is allowed to depend on the model's parameters and the renormalization constant is computed in every log pdf evaluation. However, the lower and upper bounds must be fixed constants and truncated only supports univariate base distribution. In general, it's recommended to use Constrained in the @param block and truncated in the @random and @derived blocks. Examples:
• truncated(Normal(), 0.0, Inf) is a half normal distribution.
• truncated(Cauchy(), 0.0, Inf) is a half Cauchy distribution.
• truncated(Normal(), -Inf, 0.0) is a negative half normal distribution.
• Uniform(l, u): a univariate uniform distribution with lower and upper bounds l and u respectively.
• LKJ(d, η): a matrix-variate LKJ prior over correlation matrices of size d × d. η is the positive shape parameter of the LKJ prior.
• Wishart(ν, S): a matrix-variate Wishart distribution over d × d positive definite matrices with ν degrees of freedom and a positive definite S scale matrix.
• InverseWishart(ν, ψ): a matrix-variate inverse Wishart distribution over d × d positive definite matrices with ν degrees of freedom and a positive definite scale matrix ψ.
• Beta(α, β): a univariate Beta distribution with support from 0 to 1 and shape parameters α and β.
• Gamma(α, θ): a univariate Gamma distribution over positive numbers with shape parameter α and scale θ.
• Logistic(μ, θ): a univariate logistic distribution with support (-∞, ∞), location μ and scale θ.
• LogitNormal(μ, σ): a univariate logit normal distribution with support (0, 1) and a base normal distribution with mean μ and standard deviation σ.
• TDist(ν): a univariate Student's T distribution with support (-∞, ∞), ν degrees of freedom and mean 0. To change the mean of the T distribution, you can use a LocationScale distribution (shown below).
• LocationScale(μ, σ, dist): a scaled and translated univariate distribution with a base distribution dist. The base distribution's random variable is first scaled by σ and then translated by μ. Example: LocationScale(1.0, 2.0, TDist(2)) is a scaled and translated students T distribution. The mean of the LocationScale distribution is μ + σ × mean(dist) and the standard deviation is σ × std(dist).
• Laplace(μ, σ): a univariate Laplace distribution with support (-∞, ∞), location μ and scale θ.
• Exponential(θ): a univariate exponential distribution with support (0, ∞) and scale θ.
• (Improper) flat priors: instead of using a distribution, one can specify a domain instead such as a VectorDomain for vector parameters, PSDDomain for positive definite parameters or CorrDomain for correlation matrix parameters. Those domains are treated in Pumas a flat prior. If the domain is open, this would be an improper prior. For about domains, see the Defining NLME models in Pumas section in the documentation or type ? followed by the domain name in the REPL for help.

More generally, the following table shows all the prior distributions available to use in Pumas. To learn more about the constructors of the individual distributions, you can type ? followed by the distribution name in the REPL for help.

SupportDistributions
(0, 1)Beta, KSOneSided, NoncentralBeta,LogitNormal
(0, ∞)BetaPrime, Chi, Chisq, Erlang, Exponential, FDist, Frechet, Gamma, InverseGamma, InverseGaussian, Kolmogorov, LogNormal, NoncentralChisq, NoncentralF, Rayleigh, Weibull
(-∞, ∞)Cauchy, Gumbel, Laplace, Logistic, Normal, NormalCanon, NormalInverseGaussian, PGeneralizedGaussian, TDist
Real vectorsMvNormal
Positive vectorsMvLogNormal
Positive definite matricesWishart, InverseWishart
Correlation matricesLKJ
OtherConstrained, truncated, LocationScale, Uniform, Arcsine, Biweight, Cosine, Epanechnikov, Semicircle, SymTriangularDist, Triweight, Pareto, GeneralizedPareto, GeneralizedExtremeValue, Levy

### Example 1

The following is an example of a 2 compartment population PK model written in Pumas.

poppk2cpt = @model begin
@param begin
tvcl ~ LogNormal(log(10), 0.25)
tvq ~ LogNormal(log(15), 0.5)
tvvc ~ LogNormal(log(35), 0.25)
tvvp ~ LogNormal(log(105), 0.5)
tvka ~ LogNormal(log(2.5), 1)
σ ~ Constrained(Cauchy(0, 5); lower = 0.0)
C ~ LKJ(5, 1.0)
ω ~ Constrained(
MvNormal(zeros(5), Diagonal(0.4^2 * ones(5))),
lower=zeros(5),
)
end
@random begin
ηstd ~ MvNormal(I(5))
end
@pre begin
η = ω .* (getchol(C).L * ηstd)
CL = tvcl * exp(η[1])
Q = tvq * exp(η[2])
Vc = tvvc * exp(η[3])
Vp = tvvp * exp(η[4])
Ka = tvka * exp(η[5])
end
@dynamics Depots1Central1Periph1
@derived begin
cp := @. Central / Vc
dv ~ @. LogNormal(log(cp), σ)
end
end

In the above example I(5) is a 5 × 5 identity matrix and MvNormal(I(5)) is a standard Gaussian distribution with means 0 and covariance I(5).

### Example 2

The following is an example of a PKPD model for Hepatitis C virus (HCV), where the initial values of the PD dynamics state variables depend on the model's parameters, defined using the @init block.

hcv_model = @model begin
@param begin
logθKa ~ Normal(log(0.8), 1)
logθKe ~ Normal(log(0.15), 1)
logθVd ~ Normal(log(100), 1)
logθn ~ Normal(log(2.0), 1)
logθδ ~ Normal(log(0.20), 1)
logθc ~ Normal(log(7.0), 1)
logθEC50 ~ Normal(log(0.12), 1)
ω²Ka ~ Constrained(Normal(0.25, 1); lower = 0.0)
ω²Ke ~ Constrained(Normal(0.25, 1); lower = 0.0)
ω²Vd ~ Constrained(Normal(0.25, 1); lower = 0.0)
ω²n ~ Constrained(Normal(0.25, 1); lower = 0.0)
ω²δ ~ Constrained(Normal(0.25, 1); lower = 0.0)
ω²c ~ Constrained(Normal(0.25, 1); lower = 0.0)
ω²EC50 ~ Constrained(Normal(0.25, 1); lower = 0.0)
σ²PK ~ Constrained(Normal(0.04, 1); lower = 0.0)
σ²PD ~ Constrained(Normal(0.04, 1); lower = 0.0)
end
@random begin
ηKa ~ Normal(0.0, sqrt(ω²Ka))
ηKe ~ Normal(0.0, sqrt(ω²Ke))
ηVd ~ Normal(0.0, sqrt(ω²Vd))
ηn ~ Normal(0.0, sqrt(ω²n))
ηδ ~ Normal(0.0, sqrt(ω²δ))
ηc ~ Normal(0.0, sqrt(ω²c))
ηEC50 ~ Normal(0.0, sqrt(ω²EC50))
end
@pre begin
p = 100.0
d = 0.001
e = 1e-7
s = 20000.0
logKa = logθKa + ηKa
logKe = logθKe + ηKe
logVd = logθVd + ηVd
logn = logθn + ηn
logδ = logθδ + ηδ
logc = logθc + ηc
logEC50 = logθEC50 + ηEC50
end
@init begin
T = exp(logc + logδ) / (p * e)
I = (s * e * p - d * exp(logc + logδ)) / (p * exp(logδ) * e)
W = (s * e * p - d * exp(logc + logδ)) / (exp(logc + logδ) * e)
end
@dynamics begin
X' = -exp(logKa) * X
A' = exp(logKa) * X - exp(logKe) * A
T' = s - T * (e * W + d)
I' = e * W * T - exp(logδ) * I
W' = p / ((A / exp(logVd) / exp(logEC50) + 1e-100)^exp(logn) + 1) * I - exp(logc) * W
end
@derived begin
conc := @. A / exp(logVd)
log10W := @. log10(W)
yPK ~ @. truncated(Normal(A / exp(logVd), sqrt(σ²PK)), 0)
yPD ~ @. truncated(Normal(log10W, sqrt(σ²PD)), 0)
end
end

### Cholesky factor of positive definite parameters

Sometimes when working with a positive definite matrix parameter or a correlation matrix parameter, the Cholesky factor may be needed in the model, e.g. to re-parameterize the subject-specific parameters in terms of a standard Gaussian as shown in the first example model above. In this case, the get_chol function can be called on the matrix parameter to the Cholesky factor. Let C by the correlation or positive definite matrix. The lower triangular factor of C can be obtained using:

get_chol(C).L

You can then use the transpose ' operator to transpose it if needed, e.g. get_chol(C).L'.

### Prior selection

One of the most important steps of a Bayesian workflow is choosing prior distributions for the model's parameters. The prior should generally reflect the modeler's degrees of belief level in the parameter's value which should be based on previous studies or domain knowledge. There is no one prior that fits all cases and generally it might be a good practice to follow a previous similar study's priors where a good reference can be found.

However, if you have the task of choosing good prior distributions for an all-new model, it will generally be a multi-step process consisting of:

1. Deciding the support of the prior. The support of the prior distribution must match the the domain of the parameter. For example, different priors can be used for positive parameters than those for parameters between 0 and 1. The table above can help narrow down the list of options available based on the domain of the distribution.
2. Deciding the center of the prior, e.g. mean, median or mode.
4. Deciding the shape of the probability density function (PDF) of the prior. Some distributions are left skewed, others are right skewed and some are symmetric. Some have heavier tails than others, e.g. the students T distribution is known for its heavier tail compared to a normal distribution. You should make sure the shape of the PDF reflects your knowledge about the parameter value prior to observing the data.

When choosing priors, it might make sense to simulate from the model with the prior disitributions to inspect and plot the model's predictions next to the observations. Multiple instances of parameter values are sampled from the prior distributions and the model is simulated to get multiple predictions out. Those predictions give a distribution of predictions, known as the prior predictive distribution. If the prior predictive distribution has a good overlap with or coverage of the observations, that's usually a good sign. Visual predictive check (VPC) plots can be useful for this. This will be discussed in the next section.

## Simulating from the prior model

It can be useful sometimes to simulate various quantities from the model propagating the uncertainty from the prior distributions to various intermediate quantities in the model and generating predictions. This is possible using the simobs function. The simobs function can be used to sample the parameter values from the prior and then simulate the model using the sampled values. Predictions from the model can further be plotted against the observations to get a feel for the behavior of the prior model. Let the Pumas model with priors be model and the population be data. To simulate 10 scenarios per subject from the prior you can call:

sims = simobs(model, data; samples = 10, simulate_error = true)

This will simulate 10 scenarios for each subject in data. simulate_error is a keyword argument that allows you to either sample from the error model in the @derived block (simulate_error = true) or to simply return the expected value of the error distribution (simulate_error = false) which gives you a distribution of predictions.

Other keyword arguments you can specify include:

• diffeq_options: a NamedTuple of all the differential equations solver's keyword arguments such as

alg, abstol, and reltol. The default values for these are AutoVern7(Rodas5()), 1e-12, and 1e-8 respectively. See the DifferentialEquations.jl documentation for more details.

• rng: (default is Pumas.default_rng()) the random number generator used during MCMC. This can be used for reproducibility.
• obstimes: The time points to simulate at. This defaults to the observation times of the individual subjects if unset.

After simulating, the result sims can then be inspected using a visual predictive check (VPC) plot as such:

vpc_res = vpc(sims)

using PumasPlots
vpc_plot(vpc_res)

To use vpc_plot, you will need to load the PumasPlots package by calling using PumasPlots once as shown above.

## Sampling from the posterior using MCMC

### Overview

Pumas uses the Hamiltonian Monte Carlo (HMC) based, No-U-Turn Sampler (NUTS) algorithm for Markov Chain Monte Carlo (MCMC) to sample from the posterior distribution of the population and subject-specific parameters simultaneously. The implementation used is the one in AdvancedHMC.jl which has been validated against Stan, that is an established implementation of the NUTS algorithm. The NUTS algorithm samples nchains chains from the posterior where each chain has nsamples samples. During the first nadapts < nsamples samples of each chain, the NUTS algorithm adapts its own hyper-parameters.

To sample from the posterior, you should define an algorithm object that includes all the algorithm's keyword arguments, e.g.:

alg = BayesMCMC(nsamples = 1000, nadapts = 500)

The most important keyword arguments you can set in BayesMCMC are:

• target_accept: (default is 0.8) target acceptance ratio for the NUTS algorithm. More insights about this parameter can be found in the Understanding NUTS section in this document.
• nsamples: (default is 10_000) number of MCMC samples to generate (including the adaptation samples).
• nadapts: (default is 2_000) number of adaptation steps in the NUTS algorithm which must be less than nsamples.
• nchains: (default is 4) number of MCMC chains to sample. It is recommended to not use less than 4 chains for any serious analysis but a single chain (nchains=1) may be enough for an initial experimental run. Multiple chains are important to check if the sampler is sensitive to the initial parameter values.
• diffeq_options: a NamedTuple of all the differential equations solver's keyword arguments such as

alg, abstol, and reltol. The default values for these are AutoVern7(Rodas5()), 1e-12, and 1e-8 respectively. See the DifferentialEquations.jl documentation for more details.

• rng: (default is Pumas.default_rng()) the random number generator used during MCMC. This can be used for reproducibility.

Beside the above keyword arguments, some more advanced keyword arguments will be discussed in the next subsections.

To sample from the posterior, you can call the fit function as follows:

alg = BayesMCMC(nsamples = 1_000, nadapts = 200)
iparams = init_params(model)
res = fit(model, data, iparams, alg)
tres = truncate(res; burnin = 200)

where init_params returns a NamedTuple of the default initial values of the parameters. You can inspect the output of init_params and change the initial parameter values by defining your own NamedTuple.

When sampling using the NUTS algorithm, some or all of the adaptation samples should generally be discarded as "burn-in" because the sampler is usually adapting its hyper-parameters and finding the areas of the posterior's support with a high probability mass. Not discarding at least the first few samples can often cause significant bias in the summary statistics of the chains. truncate(res, burnin = 200) removes the first 200 samples from each chain as burnin. You can also do thinning using the truncate function as follows:

tres = truncate(res; burnin = 200, ratio = 0.2)

The above line will remove 200 samples from each chain in res and then keep only approximately 20% of the remaining samples. It will do so by keeping every 5th sample in each chain and discarding the rest. The number of samples per chain in tres therefore becomes (1000 - 200) / 5 = 160.

### Parallelism

When there are multiple threads available, sampling will be multi-threaded by default. There are 2 modes of parallelism in Pumas:

• Multi-threading: uses shared memory parallelism, ideal for single machines with multiple cores and the interactive mode of PumasEnterprise.
• Multi-processing: uses distributed memory parallelism, ideal for supercomputers and the batch mode of PumasEnterprise.

To switch the type of parallelism or turn it off, you can use the ensemblealg keyword argument in the BayesMCMC constructor, e.g:

alg = BayesMCMC(ensemblealg = EnsembleThreads())

The ensemblealg keyword argument can be set to:

• EnsembleSerial(): no parallelism
• EnsembleThreads(): (default value) multi-threading
• EnsembleDistributed(): multi-processing

When using either multi-threading or multi-processing for parallelism, you can choose to parallelize over chains, over subjects or both. This is controlled using the parallel_chains and parallel_subjects keyword arguments. By default, parallelism will be done over both chains and subjects if enough threads/processes exist. For sufficiently complicated models with complex nonlinear dynamics, parallelizing over subjects (within a single chain) generally leads to a speedup. However, parallelizing over subjects has a significant overhead and for simple models it can lead to a slowdown in the sampling so it is not always recommended. In those cases, you can turn off subject parallelism using:

alg = BayesMCMC(
parallel_subjects = false,
parallel_chains = true,
)

or if you are using multi-processing instead, you can use:

alg = BayesMCMC(
ensemblealg = EnsembleDistributed(),
parallel_subjects = false,
parallel_chains = true,
)

When parallelizing over nchains chains, it's recommended to have at least nchains + 1 threads when using EnsembleThreads() or nchains + 1 processes when using EnsembleDistributed(). When parallelizing over chains and subjects simultaneously, the recommended minimum number of threads/processes is 2 × nchains + 1.

### Termination criteria

The sampling will naturally terminate when nsamples have been sampled. However in some cases, it might be desirable to end the sampling when a time limit is reached or when an approximate effective sample size (ESS) per chain has been reached whichever comes first. In Pumas, you can set time-based or ESS-based termination criteria using the following optional keyword argument:

• time_limit: an approximate time limit for sampling in seconds.
• ess_per_chain: a target (approximate) ESS per chain, calculated as the minimum ESS for all the population and subject-specific parameters in a chain.

Calculating an estimate of the ESS can be an expensive process so one can further control how often the ESS calculation is performed using the check_every keyword argument. The default value of check_every is max(2, ess_per_chain ÷ 5). The ESS-based termination criteria is experimental and only approximate so some chains may have an ESS value slightly less than the target ESS value.

### Fixing some population parameters

Pumas will sample from the posterior of all the population and subject-specific parameters by default when using BayesMCMC for the algorithm. It's possible however and sometimes desirable to fix some population parameter values and sample from the conditional posterior of the remaining parameters. This can be useful when experimenting with the model or diagnosing some failing MCMC runs. Fixing some parameters that are suspected to be problematic to known good values can be a useful strategy to introspect into the model to better understand why MCMC is failing.

You can fix some population parameters using the constantcoef keyword argument. constantcoef takes a NamedTuple of the parameters and their fixed values. For example, you can fix the value of the σ parameter to 0.1 using:

alg = BayesMCMC(constantcoef = (σ = 0.1,))

Multiple parameters can be fixed by adding more parameters to the NamedTuple, e.g. constantcoef = (σ = 0.1, tvcl = 1.0). Note that when a NamedTuple has a single element, e.g. (σ = 0.1,), adding the trailing comma after the value is important.

### Marginal MCMC

In some Bayesian analyses, sometimes the posterior of the population parameters only is needed. In other words, subject-specific parameters can be marginalized out. You can do this by either sampling from the full posterior of the population and subject-specific parameters and then simply ignoring subject-specific parameters. Or you can use marginal MCMC to sample directly from the marginal posterior of the population parameters. In Pumas, you can sample from the (approximate) marginal posterior of the population parameters by marginalizing out the subject-specific parameters in the @random block using any of:

• LaplaceI()
• FOCE()
• FO()

The above 3 estimation methods are used to approximately compute the marginal log joint probability of the population parameters and the observations. There are 2 main advantages to this approach compared to sampling from the full posterior and then ignoring parameters:

• The smaller number of parameters that are sampled in the marginal MCMC algorithm allows us to use a dense mass matrix in the NUTS algorithm which gives better adaptation and often more efficient sampling.
• Subject-level parallelism becomes more efficient because the computational load per subject per MCMC step is now much more significant making the overhead less significant. In other words, computing the marginal likelihood is significantly more expensive than computing the conditional likelihood and so each thread or process will likely be doing enough work to make up for the overhead of coordinating multiple threads/processes.

The main disadvantage of the marginal algorithm compared to the full joint MCMC is that it inherits the limitations of the Laplace and FOCE methods in terms of requiring conditional model identifiability with respect to the subject-specific parameters and requiring enough data points per subject. Not respecting such conditions can cause accuracy loss in the samples from the marginal posterior. The marginal MCMC algorithm can therefore be treated an approximate MCMC method.

You can do marginal MCMC in Pumas by defining alg to be an instance of MarginalMCMC instead of BayesMCMC.

alg = MarginalMCMC(
nsamples = 1_000,
marginal_alg = LaplaceI(),
)

All the keyword arguments discussed above for BayesMCMC can also be used in MarginalMCMC.

For advanced users or complicated models, it may be required to change some advanced keyword arguments in the NUTS algorithm such as the maximum tree depth. The default maximum tree depth in Pumas is 10. To change the value of the maximum tree depth, you can pass an AdvancedHMC NUTS algorithm instance to the BayesMCMC or MarginalMCMC constructors using:

alg = BayesMCMC(
alg = GeneralizedNUTS(max_depth = 15),
)

or

alg = MarginalMCMC(
alg = GeneralizedNUTS(max_depth = 15),
)

### Understanding NUTS

The behavior of the NUTS algorithm can often be difficult to reason about since the hyper-parameters of the algorithm interact with each other and with the model in often non-obvious ways. The goal of this section is to help you understand NUTS at an intuitive level and to give general guidelines on how to set the hyper-parameters of the algorithm and what the effect of increasing or decreasing each hyper-parameter is.

#### MCMC intuition

Let's first try to understand MCMC at an intuitive level. MCMC relies on a propose-and-test algorithm that steps from the current parameter values to new values known as the "proposal". This proposal is then tested to be either accepted as the next sample or rejected. If a proposal is rejected, the next sample will be a mere copy of the current sample (the sampler didn't move). If a proposal is accepted, the proposal becomes the next sample. The bases of this acceptance-rejection test (aka Metropolis-Hastings test) are the prior distribution and the likelihood function. The joint probability of the new parameter values and the observed data is compared to the joint probability of the old parameter values and the observed data. A proposal leading to bad predictions that don't fit the data well compared to the previous sample, or a proposal that is improbable according to the prior is more likely to be rejected. On the other hand, a proposal that fits the data better than the previous sample and/or is more probable according to the prior will be more likely to be accepted.

#### Exploration vs exploitation

The NUTS algorithm adapts its stepping algorithm to encourage a certain fraction of the proposals to get accepted on average. This fraction is the target_accept option in BayesMCMC and MarginalMCMC above. A value of 0.99 means that we want to accept 99% of the proposals the sampler makes. This will generally lead to small step sizes between the proposal and the current sample since this increases the chance of accepting such a proposal. On the other hand, a target acceptance fraction of 0.2 means that we want to only accept 20% of the proposals made on average. The NUTS algorithm will therefore attempt larger step sizes to ensure it rejects 80% of the proposals. In general, a target_accept value of 0.6-0.8 is recommended to use.

In sampling, there is usually a tradeoff between exploration and exploitation. If the sampler is too "adventurous", trying aggressive proposals that are far from the previous sample in each step, the sampler would be more likely to explore the full posterior and not get stuck sampling near a local mode of the posterior. However on the flip side, too much exploration will often lead to many sample rejections due to low joint probability of the data and the adventurous proposals. This can decrease the ratio of the effective sample size (ESS) to the total number of samples (aka relative ESS) since a number of samples will be mere copies of each other due to rejections.

On the other hand if we do less exploration, there are 2 possible scenarios:

• The first scenario is if we initialize the sampler from a mode of the posterior. Making proposals only near the previous sample will ensure that we accept most of the samples since proposals near a mode of the posterior are likely to be good parameter values. This local sampling behavior around known good parameter values is what we call exploitation. While the samples generated via high exploitation around a mode may not be representative of the whole posterior distribution, they might still give a satisfactory approximation of the posterior predictive distributions, which is to be judged with a VPC plot.
• The second scenario is if we initialize the sampler from bad parameter values. Bad parameter values and low exploration often lead to optimization-like behavior where the sampler spends a considerable number of iterations moving towards a mode in a noisy fashion. This optimization-like, mode-seeking behavior causes a high auto-correlation in the samples since the sampler is mostly moving in the same direction (towards the mode). A high auto-correlation means a low ESS because the samples would be less independent from each other. Also until the sampler reaches parameter values that actually fit the data well, it's unlikely these samples will lead to a good posterior predictive distribution. This is a fairly common failure mode of MCMC algorithms when the adaptation algorithm fails to find a good stepping algorithm that properly explores the posterior distribution due to bad initial parameters and the model being too complicated and difficult to optimize, let alone sample from its posterior. In this case, all the samples may look auto-correlated and the step sizes between samples will likely be very small (low exploration). It's often helpful to detect such a failure mode early in the sampling and kill the sampling early.

The NUTS algorithm adapts its proposal mechanism to achieve the target acceptance ratio. However, the definition of an "adventurous" proposal may be different depending on the parameter values of the current sample. For relatively flat regions of the posterior where a lot of parameter values are almost equally likely (i.e. they all the fit the data well and are almost equally probable according to the prior), proposals far away from the current sample may still be accepted most of the time and are therefore not that adventurous. This is especially likely in the parts of the posterior where the model is non-identifiable or there are high parameter correlations, and the prior is indiscriminate (e.g. due to being a weak prior).

On the other hand, regions of the posterior that are heavily concentrated around a mode with a high curvature often require a smaller step size to achieve reasonable acceptance ratios, since proposals that are even slightly far from the current sample may be extremely improbable according to the prior or may lead to very bad predictions. This is especially likely in regions of the posterior where the model is highly sensitive to the parameter values or if the prior is too strongly concentrated around specific parameter values.

To account for such variation in curvature (wrt the same parameter in different regions), the NUTS algorithm uses a multi-step proposal mechanism with a fixed step size (determined during adaptation and then fixed) and a dynamic number of steps (dynamic in both adaptation and regular sampling). In other words, the sampler follows a trajectory of $n \geq 1$ steps before proposing a new sample to move to, where $n$ is different in each proposal. This proposal then gets tested and is either accepted or rejected by comparing it to the previous sample. The exact trajectory taken by the NUTS sampler to propose a new sample is obtained by a binary tree search procedure with the maximum depth of the tree being a hyper-parameter of the algorithm. The goal of the tree search is to look for a point in the trajectory where a U-Turn happens, i.e. the sampler is beginning to trace back its steps. Once a U-turn is found, a point (randomly chosen) before the U-Turn becomes the next proposal and the search is terminated. This is typically considered a sign of successful exploration.

This No-U-Turn criteria for proposing samples is why the algorithm is called the No-U-Turn sampler (NUTS). During the adaptation phase, the sampler chooses a step size to best match the target acceptance ratio set by the user. For some complicated models, the adaptation can often result in a step size that is too small. In such a case, no U-Turn may be found early and the tree search may have to run to completion before making a single proposal. If this happens in almost every iteration of the MCMC sampling, this will be bad for performance because the number of model evaluations required by a complete search tree of depth $d$ is $2^d - 1$. The default maximum tree depth (the maximum $d$) in Pumas is 10 (that means $d^10-1= 1023$ evaluations) but it can be changed as was discussed in the Advanced hyper-parameters section.

The step size and trajectory length are not the only ways which the NUTS algorithm uses to adapt the level of exploration in the sampling. The so-called mass matrix (determined during adaptation and then fixed afterwards) allows the sampler to adapt the amount of exploration differently for different directions. This is especially important for models where parameters are on different scales so variables smaller in magnitude should generally be changing less between samples than variables larger in magnitude. In Pumas, we use a diagonal mass matrix in BayesMCMC where different exploration levels are used for different model parameters. If the mass matrix is dense like in the MarginalMCMC algorithm, the exploration level can be different along arbitrary directions and not just along the parameters' axes. This tends to help when the parameters are heavily correlated in the posterior. The reason why it's called the mass matrix is an analogy to Hamiltonian dynamics that the HMC and NUTS algorithms were inspired from.

#### What to do if MCMC fails or is too slow?

Dynamics-based models with complicated stiff differential equations often suffer from sensitivity to parameter values in 2 ways:

• First, small changes in the parameter values may lead to extremely different dynamics and wrong predictions thus leading to rejections.
• And second, changes in the parameter values may make the differential equation highly stiff thus slowing down convergence or even causing divergence of the solver.

In other words, MCMC for some complicated models can often run slow and fail to give good ESS values at the end. In such cases, NUTS may not always be computationally feasible. But you can try any of the following remedies and workarounds to poke at the model:

• Lower the target_accept option. This may alleviate the need for a small step size and a full tree exploration.
• Re-parameterize your model to have less parameter dependence.
• Fix some parameter values to known good values, e.g. values obtained by maximum-a-posteriori (MAP) optimization.
• Initialize the sampling from good parameter values.
• Use a stronger prior around suspected good parameter values.
• Simplify your model, e.g. using simpler dynamics.
• Try the marginal MCMC algorithm MarginalMCMC instead of the full joint MCMC algorithm BayesMCMC.

If you find the sampler regularly hitting the maximum tree depth of 10 in the initial exploration phase, it might make sense to decrease that initially to have quicker iterations when in the exploration phase of the study. This is effectively limiting the level of exploration in the sampling so it might make sense to use good initial values when doing this. However in the final phase of the study, it is best to make sure that the maximum tree depth is not reached by the sampler (increasing it if necessary). This might also slow down your sampling significantly so there can be a tradeoff here. It's also best to ensure that the sampler converges to the posterior when starting from multiple different random initial points using different chains.

## Convergence and diagnostics

MCMC asymptotically converges to the target distribution. That means, if we have all the time in the world, it is guaranteed, irrelevant of the target distribution posterior geometry, MCMC will give you the right answer. However, we don't have all the time in the world. The NUTS algorithm was developed to reduce the sampling (and warmup) time necessary for convergence to the target distribution.

### Can we prove convergence?

In the ideal scenario, the NUTS sampler converges to the true posterior and doesn't miss on any mode. Unfortunately, this is not easy to prove in general and all the convergence diagnostics are only tests for symptoms of lack of convergence. In other words if all the diagnostics look normal, then we can't prove that the sampler didn't converge but we also can't prove that the sampler actually converged. Some signs of lack of convergence are:

• Any of the moments (e.g. the mean or standard deviation) is changing with time. This is diagnosed using stationarity tests by comparing different parts of a single chain to each other.
• Any of the moments is sensitive to the initial parameter values. This is diagnosed using multiple chains by comparing their summary statistics to each other.

While high auto-correlation is not strictly a sign of lack of convergence, samplers with high auto-correlation will require many more samples to get to the same ESS as another sampler with low auto-correlation. So a low auto-correlation is usually more desirable.

### Is exploring the posterior important?

Since we can't prove that the sampler explored the full posterior in general, is exploring the full posterior always absolutely necessary? That depends on what you want to do. If you are trying to answer questions about the parameters, e.g. estimating the probability that an effect is greater than or less than 0 for a go/no-go decision, then you need your sampler to sample from the true posterior. Of course, we cannot prove this in general anyways but you should generally follow all the best practices and you should not ignore signs of lack of convergence. Some bad signs to watch out for if you want to sample from the true posterior are:

1. Non-stationarity of the samples' distribution
2. Dependence of the samples' distribution on the initial parameters after the adaptation steps
3. High auto-correlation in the samples after the adaptation steps
4. Too many rejections and ODE solver divergences
5. Low ESS values relative to the number of samples
6. Extremely small step sizes and hitting the maximum tree depth often

On the other hand, if your goal is not to answer questions about the parameters but only to make predictions using the posterior predictive distribution as an ensemble of predictions, then sampling from the true posterior may not be strictly necessary in this case. If the posterior predictive distribution gives enough accuracy and uncertainty in the predictions to reflect the uncertainty in the unseen data, then that may suffice and we can live with some imperfections in the sampling. Some imperfections in the sampling include:

1. Having to initialize the sampler from a mode to get the sampler to work
2. Using a low maximum tree depth and allowing the maximum to be reached
3. Using a high target acceptance ratio to decrease exploration and sample around a mode
4. High auto-correlation in the samples even after the adaptation steps and low relative ESS
5. Using a few adaptation steps nadapts

If doing any or all of the above resulted in fast sampling that gives a good enough posterior predictive distribution but potentially bad posterior exploration and if predictions are what you care the most about then perhaps you don't need to sample from the true posterior in your use case. Such an imperfect solution is often satisfactory in the context of Bayesian neural networks for example where parameters are generally meaningless and we only care about predictions. Of course, we don't advocate for doing this in general since this goes against the best practices of MCMC but it's an option you have.

### Effective sample size and $\widehat{R}$

The Effective Sample Size (ESS) is an approximation of the number of independent samples generated by a Markov chain. MCMC samples are typically auto-correlated because each sample is generated from a proposal based on the previous sample. The higher the auto-correlation, the lower the ESS of the samples for the same number of samples. The relative ESS is the ratio of the ESS and the total number of samples. Good ESS values to aim for are 100-200 per chain. The ESS per second of a sampler is roughly a measure of how efficient the sampler is at exploring the posterior, assuming it converges to the posterior.

The $\widehat{R}$ is the so-called potential scale reduction which is approximately $1 + r$ where $r$ is the ratio of the variance of chain means to the mean of chain variances. This measures whether the chains have successfully "mixed". The chains are said to have mixed when they have converged to the same distribution with the same mean and variance. When the chains converge, the value of $r$ tends to 0 and that of $\widehat{R}$ tends to 1 as the number of samples goes to infinity.

### Diagnostic plots

There are a number of diagnostic plots which one can use to gain insights into how the sampler is performing. To use the diagnostic plots, you have to load the PumasPlots package first using:

using PumasPlots

For the examples in this section, we'll use the following population PK model:

poppk2cpt = @model begin
@param begin
tvcl ~ LogNormal(log(10), 0.25)
tvq ~ LogNormal(log(15), 0.5)
tvvc ~ LogNormal(log(35), 0.25)
tvvp ~ LogNormal(log(105), 0.5)
tvka ~ LogNormal(log(2.5), 1)
σ ~ Constrained(Cauchy(0, 5); lower = 0.0)
C ~ LKJ(5, 1.0)
ω ~ Constrained(
MvNormal(zeros(5), Diagonal(0.4^2 * ones(5)));
lower=zeros(5),
)
end
@random begin
ηstd ~ MvNormal(I(5))
end
@pre begin
η = ω .* (getchol(C).L * ηstd)
CL = tvcl * exp(η[1])
Q = tvq * exp(η[2])
Vc = tvvc * exp(η[3])
Vp = tvvp * exp(η[4])
Ka = tvka * exp(η[5])
end
@dynamics Depots1Central1Periph1
@derived begin
cp := @. Central / Vc
dv ~ @. LogNormal(log(cp), σ)
end
end

Additionally, let tres be the output of fit followed by truncate.

#### Trace plot

The trace plot of a parameter shows the value of the parameter in each iteration of the MCMC algorithm. A good trace plot is one that:

• Is noisy, not an increasing or decreasing line for example.
• Has a fixed mean.
• Has a fixed variance.
• Shows all chains overlapping with each other, aka chain mixing.

You can plot a trace plot for the population parameters :tvq and :tvcl using:

trace_plot(tres; parameters = [:tvq, :tvcl])

When using BayesMCMC, you can also plot the trace plots of the subject-specific parameters for a selection of subjects using:

trace_plot(tres; subjects = [1, 2])

The above are examples of well mixed chains that don't indicate non-convergence.

#### Cumulative mean plot

The cumulative mean plot of a parameter shows the mean of the parameter value in each MCMC chain up to a certain iteration. An MCMC chain converging to a stationary posterior distribution should have the cumulative mean of each parameter converge to a fixed value. Furthermore, all the chains should be converging to the same mean for a given parameter, the posterior mean. If the cumulative mean curve is not converging or the chains are converging to different means, this is a sign of non-convergence.

You can plot a cumulative mean plot for the population parameter :tvcl using:

cummean_plot(tres; parameters = [:tvcl])

You can plot the cumulative plot of multiple parameters simultaneously by adding more parameters to the parameter names vector, e.g. [:tvq, :tvcl].

When using BayesMCMC, you can also plot the cumulative mean plots of the subject-specific parameters for a selection of subjects using:

cummean_plot(tres; subjects = [1, 2])

#### Density plot

The density plot of a parameter shows a smoothed version of the histogram of the parameter values, giving an approximate probability density function for the marginal posterior of the parameter considered. This helps us visualize the shape of the marginal posterior of each parameter.

You can plot a density plot for the population parameter :tvcl using:

density_plot(tres; parameters = [:tvcl])

You can plot the density plots of multiple parameters simultaneously by adding more parameters to the parameter names vector, e.g. [:tvq, :tvcl].

When using BayesMCMC, you can also plot the density plots of the subject-specific parameters for a selection of subjects using:

density_plot(tres; subjects = [1, 2])

#### Auto-correlation plot

MCMC chains are prone to auto-correlation between the samples because each sample in the chain is a function of the previous sample. The auto-correlation plot shows the correlation between every sample with index s and the corresponding sample with index s + lag for all s ∈ 1:N-lag where N is the total number of samples. For each value of lag, we can compute a correlation measure between the samples and their lag-steps-ahead counterparts. The correlation is usually a value between 0 and 1 but can sometimes be between -1 and 0 as well. The auto-correlation plot shows the lag on the x-axis and the correlation value on the y-axis. For well behaving MCMC chains when lag increases, the corresponding correlation gets closer to 0. This means that there is less and less correlation between any 2 samples further away from each other. The value of lag where the correlation becomes close to 0 can be used to guide the thinning of the MCMC samples to extract mostly independent samples from the auto-correlated samples. The truncate function can be used to perform thinning with the ratio keyword set to 1 / lag for an appropriate value of lag.

truncate(tres; ratio = 1/lag)

You can plot an auto-correlation plot for the population parameter :tvcl using:

autocor_plot(tres; parameters = [:tvcl])

You can plot the auto-correlation plots of multiple parameters simultaneously by adding more parameters to the parameter names vector, e.g. [:tvq, :tvcl].

When using BayesMCMC, you can also plot the auto-correlation plots of the subject-specific parameters for a selection of subjects using:

autocor_plot(tres; subjects = [1, 2])

#### Ridge line plot

The ridge line plot shows similar information as the density plot in addition to the credible interval and quantile information.

You can plot an ridge line plot for the population parameter :tvcl using:

ridgeline_plot(tres; parameters = [:tvcl])

You can plot the ridge line plots of multiple parameters simultaneously by adding more parameters to the parameter names vector, e.g. [:tvq, :tvcl].

When using BayesMCMC, you can also plot the ridge line plots of the subject-specific parameters for a selection of subjects using:

ridgeline_plot(tres; subjects = [1, 2])

#### Corner plot

You can plot a corner plot for the population parameters tvq and :tvcl using:

corner_plot(tres, parameters = [:tvq, :tvvc])

When using BayesMCMC, you can also plot the corner plot of the subject-specific parameters for a selection of subjects using:

corner_plot(tres; subjects = [1, 2])

### Other diagnostics

• When the MCMC algorithm hasn't converged,
• How many samples to throw away as burn-in, or
• How long to run the MCMC for.

#### Geweke diagnostic

gewekediag(tres::BayesMCMCResults; subject::Union{Nothing,Int} = nothing, first::Real = 0.1, last::Real = 0.5)

The above function computes the Geweke diagnostic for each chain outputting a p-value per parameter. If subject is nothing (default), the chains diagnosed are those of the population parameters. If subject is set to an integer index, the chains diagnosed are those of the subject-specific parameters corresponding to the subject with the input index.

The Geweke diagnostic compares the sample means of two disjoint sub-chains X₁ and X₂ of the entire chain using a normal difference of means hypothesis test where the null and alternative hypotheses are defined as:

• H₀: μ₁ = μ₂
• H₁: μ₁ ≂̸ μ₂

where μ₁ and μ₂ are the population means. The first sub-chain X₁ is taken as the first (first * 100)% of the samples in the chain, where first is a keyword argument defaulting to 0.1. The second sub-chain X₂ is taken as the last (last * 100)% of the samples in the chain, where last is a keyword argument defaulting to 0.5.

The test statistic used is: z₀ = x̅₁ - x̅₂ / √(s₁² + s₂²) where x̅₁ and x̅₂ are the sample means of X₁ and X₂ respectively, and s₁ and s₂ are the Markov Chain standard error (MCSE) estimates of X₁ and X₂ respectively. Auto-correlation is assumed within the samples of each individual sub-chain, but the samples in X₁ are assumed to be independent of the samples in X₂. The p-value output is an estimate of P(|z| > |z₀|), where z is a standard normally distributed random variable.

Low p-values indicate one of the following:

• The first and last parts of the chain are sampled from distributions with different means, i.e. non-convergence,
• The need to discard some initial samples as burn-in, or
• The need to run the sampling for longer due to lack of samples or high auto-correlation.

High p-values (desirable) indicate the inability to conclude that the means of the first and last parts of the chain are different with statistical significance. However, this alone does not guarantee convergence to a fixed posterior distribution because:

• Either the standard deviations or higher moments of X₁ and X₂ may be different, or
• The independence assumption between X₁ and X₂ may not be satisfied when high auto-correlation exists.

#### Heidelberger and Welch diagnostic

heideldiag(tres::BayesMCMCResults; subject::Union{Nothing,Int} = nothing, alpha::Real = 0.05, eps::Real = 0.1, start::Integer = 1)

The above function computes the Heidelberger and Welch diagnostic for each chain. If subject is nothing (default), the chains diagnosed are those of the population parameters. If subject is set to an integer index, the chains diagnosed are those of the subject-specific parameters corresponding to the subject with the input index.

The Heidelberger diagnostic attempts to:

• Identify a cutoff point for the initial transient phase for each parameter, after which the samples can be assumed to come from a steady-state posterior distribution. The initial transient phase can be removed as burn-in. The cutoff point for each parameter is given in the burnin column of the output dataframe.
• Estimate the relative confidence interval (confidence interval divided by the mean) for the mean of the steady-state posterior distribution of each parameter, assuming such steady-state distribution exists in the samples. A large confidence interval implies either the lack of convergence to a stationary distribution or lack of samples. Half the relative confidence interval is given in the halfwidth column of the output dataframe. The test column will be true (1) if the halfwidth is less than the input target eps (default is 0.1) and false (0) otherwise. Note that parameters with mean value close to 0 can have erroneous relative confidence intervals because of the division by the mean. The test value can therefore be expected to be false (0) for those parameters without concluding lack of convergence.
• Quantify the extent to which the distribution of the samples is stationary using statistical testing. The returned p-value, shown in the pvalue column of the output dataframe, can be considered a measure of mean stationarity. A p-value lower than the input threshold alpha (default is 0.05) implies lack of stationarity of the mean, i.e. the posterior samples did not converge to a steady-state distribution with a fixed mean.

The Heidelberger diagnostic only tests for the mean of the distribution. Therefore, it can only be used to detect lack of convergence and not to prove convergence. In other words, even if all the numbers seem normal, one cannot conclude that the chain converged to a stationary distribution or that it converged to the true posterior.

#### Raftery and Lewis diagnostic

rafterydiag(tres::BayesMCMCResults; subject::Union{Nothing,Int} = nothing, q = 0.025, r = 0.005, s = 0.95, eps = 0.001)

The above function computes the Raftery and Lewis diagnostic. This diagnostic is used to determine the number of iterations required to estimate a specified quantile q within a desired degree of accuracy. The diagnostic is designed to determine the number of autocorrelated samples required to estimate a specified quantile \thetaq, such that \Pr(\theta \le \thetaq) = q$, within a desired degree of accuracy. In particular, if$\hat{\theta}_q$is the estimand and$\Pr(\theta \le \hat{\theta}_q) = \hat{P}_q$the estimated cumulative probability, then accuracy is specified in terms of r and s, where$\Pr(q - r < \hat{P}_q < q + r) = s$. Thinning may be employed in the calculation of the diagnostic to satisfy its underlying assumptions. However, users may not want to apply the same (or any) thinning when estimating posterior summary statistics because doing so results in a loss of information. Accordingly, sample sizes estimated by the diagnostic tend to be conservative (too large). Furthermore, the argument r specifies the margin of error for estimated cumulative probabilities and s the probability for the margin of error. eps specifies the tolerance within which the probabilities of transitioning from initial to retained iterations are within the equilibrium probabilities for the chain. This argument determines the number of samples to discard as a burn-in sequence and is typically left at its default value. ## Posterior queries ### Summary statistics There are a number of functions you can call on the output of the fit or truncate function. Assume res is the output from fit. You can use the get_params function to get a vector of posterior samples of the population parameter: get_params(res) When using BayesMCMC, you can further get the posterior samples of the subject-specific parameters using: get_randeffs(res, subject_ind) where subject_ind is an integer of the subject index in the fit training population. You can compute summary statistics of the posterior samples using the summarystats function. summarystats(res) summarystats(res; subject = 1) summarystats(res) computes the summary statistics of the population parameters and summarystats(res, subject = 1) computes the summary statistics of the first subject's subject-specific parameters. This includes information about the mean, standard deviation, Monte Carlo standard error (MCSE), ESS,$\widehat{R}$and ESS per second. You can find the probability of certain binary outcomes of the posterior samples using the mean function. For example, the following computes the probability that the population parameter p.θ[1] is less than 0. mean(res) do p p.θ[1] < 0 end The following can be used to compute the probability that the subject-specific parameter η[2] for subject 1 is less than or equal to 0: mean(res; subject = 1) do p p.η[2] <= 0 end You can replace the binary outcome c.η[2] < 0 with an arbitrary function of the parameters. For example, if C was a correlation matrix parameter and you want the mean of the lower triangular Cholesky factors of the C samples, you can do: mean(res) do p cholesky(p.C).L end You can call the cor function on res to get a correlation matrix of the population parameters. cor(res) To get the correlations of the subject-specific parameters for subject i, you can use: cor(res; subject = i) ### Credible intervals The credible interval with a target probability$p$of a parameter is any interval that has a probability mass of approximately$p$according to the posterior distribution. That is the probability that the parameter value is in this interval according to the posterior distribution should be$\approx p$. There are 2 main ways to define the credible interval. The first is using quantiles. Say the target probability is 95%. The interval from the 2.5% quantile to the 97.5% quantile is an interval where 95% of the posterior samples lie, hence it has approximately 95% probability mass. You can do this in Pumas using: quantile(res, [0.025, 0.975]) to get the 2.5% and 97.5% quantiles of all the population parameters. To do the same for subjecct-specific parameters for subject i, you can do: quantile(res, [0.025, 0.975]; subject = i) However, using quantiles is generally going to lead to a wider than necessary interval unless the posterior probability density function is symmetric around the median. In other words, we can often find smaller intervals that still has the same target probability mass of 95%. Such an interval is often known as the highest probability density (HPD) interval. You can get the HPD interval for the 95% target probability using the hpd function in Pumas: hpd(res; alpha = 1 - 0.95) To do the same for the subject-specific parameters of subject i, you can do: hpd(res; subject = i, alpha = 1 - 0.95) Using the HPD interval as the credible interval is generally preferred over the quantile method. ### Simulate from the posterior You can simulate the model under samples from the posterior using the simobs function as such: sims = simobs(tres; samples = 10, simulate_error = true) where tres is the result from the fit function (and optionally truncate). This generates 10 samples from the posterior and simulates the model for each subject in the training data. The resulting simulations will contain all the simulated intermediate quantities from the model, e.g. the @pre block variables and @dynamics solution. simulate_error is a keyword argument that allows you to either sample from the error model in the @derived block (simulate_error = true) or to simply return the expected value of the error distribution (simulate_error = false) which gives you a distribution of predictions, aka the posterior predictive distribution. To simulate for an individual subject with index subject_ind, you can use the following instead: sims = simobs(tres; subject = subject_ind, samples = 10, simulate_error = true) Other keyword arguments you can specify include: • diffeq_options: a NamedTuple of all the differential equations solver's keyword arguments such as alg, abstol, and reltol. The default values for these are AutoVern7(Rodas5()), 1e-12, and 1e-8 respectively. See the DifferentialEquations.jl documentation for more details. • rng: (default is Pumas.default_rng()) the random number generator used during MCMC. This can be used for reproducibility. • obstimes: The time points to simulate at. This defaults to the observation times of the individual subjects if unset. After simulating, the result sims can then be inspected using a visual predictive check (VPC) plot as such: vpc_res = vpc(sims) using PumasPlots vpc_plot(vpc_res) To use vpc_plot, you will need to load the PumasPlots package by calling using PumasPlots once as shown above. The following is a VPC plot for the first subject, obtained using: using PumasPlots sims = simobs(tres, subject = 1, samples = 10, simulate_error = true) vpc_res = vpc(sims) vpc_plot(vpc_res) The result from simobs can be further queried using the postprocess function which you can learn more about from the Post-processing simulations section in the Simulations documentation documentation. ## Crossvalidation and model selection ### Background In the Bayesian workflow, it is common to evaluate and compare models using their predictive power for out-of-sample data. One popular model evaluation metric for out-of-sample prediction accuracy is the so-called expected log predictive density (ELPD). Let$\mathcal{M}$be a model with parameters$\theta \in \Theta$that describes the data generating process of the observed data$y. The ELPD is defined as: \begin{align} \text{ELPD} = \int_{y_{pred}} \log p(y_{pred} | y, \mathcal{M}) p_t(y_{pred}) dy_{pred} \end{align} wherey_{pred}$is unobserved data,$p_t(y_{pred})$is the true data generating distribution of$y_{pred}$(unknown in practice) and$p(y_{pred} | y, \mathcal{M})is the posterior predictive density defined as: \begin{align} p(y_{pred} | y, \mathcal{M}) = \int_{\Theta} p(y_{pred} | \theta, \mathcal{M}) p(\theta | y, \mathcal{M}) d\theta \end{align} wherep(\theta | y, \mathcal{M})$describes the posterior distribution of$\theta$given the observed data$y$and the model$\mathcal{M}$. Since the true data generating distribution is unknown, it is common to approximate ELPD by an empirical distribution over the observed data. One such estimator is the log pointwise predictive density (lppd). Let$y_i$be the$i^{th}$observation and$\theta^{[j]}$be the$j^{th}$sample draw from the posterior$p(\theta | y, \mathcal{M}). The lppd can be calculated using: \begin{align} \text{lppd} & = \frac{1}{N} \sum_{i=1}^N \log p(y_i | y, \mathcal{M}) \\ & = \frac{1}{N} \sum_{i=1}^N \log \int_{\Theta} p(y_i | \theta, \mathcal{M}) p(\theta | y, \mathcal{M}) d\theta \\ & \approx \frac{1}{N} \sum_{i=1}^N \log \Big( \frac{1}{M} \sum_{j=1}^M p(y_i | \theta^{[j]}, \mathcal{M}) \Big) \end{align} A shortcoming of the lppd is that it is not representative of predictive accuracy on unseen data, sincey_i$is used both for inference on the posterior and to evaluate the model out-of-sample. #### Leave-k-out crossvalidation Crossvalidation overcomes this problem by ensuring that$y_i$is not used for inference on the posterior when evaluating the out-of-sample performance for$y_i. The simplest way to divide the data into in-sample and out-of-sample subsets is the leave-one-out (loo) crossvalidation where in each outer iteration, one data point is considered out-of-sample and the remaining are in-sample. The leave-one-out, log predictive density (loo-lpd) is defined as: \begin{align} \text{loo-lpd} & = \frac{1}{N} \sum_{i=1}^N \log p(y_i | y_{-i}, \mathcal{M}) \\ & = \frac{1}{N} \sum_{i=1}^N \log \int_{\Theta} p(y_i | \theta, \mathcal{M}) p(\theta | y_{-i}, \mathcal{M}) d\theta \\ & \approx \frac{1}{N} \sum_{i=1}^N \log \Big( \frac{1}{M} \sum_{j=1}^M p(y_i | \theta^{[j]}_{\backslash i}, \mathcal{M}) \Big) \end{align} wherey_{-i}$is all data excluding$y_i$and$\theta^{[j]}_{\backslash i}$is the$j^{th}$draw of a sample from the posterior$p(\theta | y_{-i}, \mathcal{M})$. This can be generalised to leave$k$-out cross validation where$y_i$is replaced with a vector of observations of length$k. #### Leave-future-k-out crossvalidation When working with time-series data, it can often be more useful to evaluate models based on their ability to predict future values using nothing but past values for training. This gives rise to another variant of crossvalidation called leave-future-one-out (lfoo) crossvalidation and the lfoo-lpd which is defined as: \begin{align} \text{lfoo-lpd} & = \frac{1}{N - t} \sum_{i=t+1}^N \log p(y_i | y_{1:i-1}, \mathcal{M}) \\ & = \frac{1}{N - t} \sum_{i=t+1}^N \log \int_{\Theta} p(y_i | \theta, \mathcal{M}) p(\theta | y_{1:i-1}, \mathcal{M}) d\theta \\ & \approx \frac{1}{N - t} \sum_{i=t+1}^N \log \Big( \frac{1}{M} \sum_{j=1}^M p(y_i | \theta^{[j]}_{\backslash i:N}, \mathcal{M}) \Big) \end{align} wheret$is the minimum number of data points used for training/inference,$y_{1:i-1}$is the past data up to point$i - 1$and$\theta^{[j]}_{\backslash i:N}$is the$j^{th}$draw of a sample from the posterior$p(\theta | y_{1:i-1}, \mathcal{M})$obtained by excluding the points$y_{i:N}from the inference. #### Crossvalidation for hierarchical models When performing crossvalidation in a hierarchical model, there are multiple ways to measure the predictive power of the model. For instance in hierarchical PKPD modeling, the goal is to learn a population model to make predictions on new patients while simultaneously learning subject-specific models to make future predictions for specific subjects having seen their past response to drugs. These models are useful for dose selection and dose adaptation for new or existing patients with the objective of maximizing the therapeutic effect while avoiding toxicity. Depending on the prediction task of interest, one may choose to treat each time observation as a data point or each entire patient/subject as a data point. If the goal is to evaluate the model's ability to predict responses for new patients, leave-one-subject-out or leave-one-future-subject-out crossvalidation should be used. Alternatively, if the goal is to evaluate the model's ability to predict future drug concentrations or any other observable time-dependent quantity the model predicts, then leaving future observations out for each subject makes more sense. This will be called leave-one-observation-out or leave-one-future-observation-out crossvalidation. The choice of what constitutes a point to leave out when doing crossvalidation affects the way the predictive likelihoods are computed: \begin{align} p(y_i | \theta^{[j]}, \mathcal{M}) \end{align} When leaving subjects out, we are often interested in the marginal likelihood of this subject given a posterior draw of the population parameters, marginalizing out the subject-specific parameters. Alternatively, the conditional likelihood can also be used for some default or typical values of the subject-specific parameters, e.g. the mode of the prior distribution. To marginalize subject-specific parameters, approximate integration methods such as LaplaceI and FOCE can be used to obtain the marginal likelihood. On the other hand when leaving observations out in a single subject, the quantity of interest is often the conditional likelihood given each draw from the joint posterior of population and subject-specific parameters of individual subjects given previous data of the subject. #### Pareto smoothed importance sampling crossvalidation Evaluating the loo-lpd or lfoo-lpd is expensive since one needs to draw samples fromN$or$N - t$different posteriors, e.g. from$p(\theta | y_{-i}, \mathcal{M})$for$i=1,\dots,N\$. Typically this will be done by MCMC, e.g. the NUTS algorithm, which in spite of recent progress, remain computationally expensive when the number of parameters is large and the curvature of the posterior is uneven along one or more dimensions.

One approach to overcome this difficulty, is the Pareto-smoothed importance sampling method for leave-one-out, crossvalidation (PSIS-LOO-CV). In PSIS-LOO-CV, MCMC is run only once on the full data. The same samples are then re-used in each outer iteration of CV but using different weights. The weights are determined using importance sampling (IS) by comparing the likelihood with one data point left out to the likelihood of the full dataset. The raw importance weights are then smoothed by fitting them to a generalized Pareto distribution. The smoothed weights can then be used to estimate the ELPD contribution of each data point.

Beside the ability to approximate the ELPD, PSIS-LOO-CV also provides a useful diagnostic which is the shape parameter of the Pareto distribution fitted to the raw weights when leaving out each data point. Data points that when removed lead to a large shape parameter are more influential than data points which have a low shape parameter. For highly influential points where the Pareto shape parameter is higher than 0.7, the ELPD contribution for this point can be considered unreliable. In those cases, resampling from the posterior after removing the influential point is recommended instead of relying on PSIS-LOO-CV.

### Crossvalidation in Pumas

You can do almost all types of crossvalidation for hierarchical models using Pumas. The main function that performs crossvalidation in Pumas is the crossvalidate function. There are 2 inputs to crossvalidate:

• The MCMC result from fit or truncate. We will call this res.
• The crossvalidation algorithm. Let's call it cv_method.
crossvalidate(res, cv_method)

There are 2 families of the crossvalidation algorithms available in Pumas:

• Resampling-based CV which performs the MCMC again for each data point or subset left out. This algorithm is constructed using the ExactCrossvalidation struct.
• PSIS-based CV which uses the importance sampling and weight smoothing approach to avoid the need for resampling. This algorithm is constructed using the PSISCrossvalidation struct.

The constructor for a resampling-based CV method is

cv_method = ExactCrossvalidation(; split_method, split_by, ensemblealg)

This defines an instance of the ExactCrossvalidation algorithm for crossvalidation. In this algorithm, the fitting is re-run while leaving out a subset of the data each time.

The way by which the data is split between training and validation sets is determined using the keyword arguments split_method and split_by. The split_method argument can be of any of the following types:

• LeaveK for leave-K-out crossvalidation
• KFold for K-fold crossvalidation
• LeaveFutureK for leaving K future points at a time

while split_by can be of any of the following types:

• BySubject for leaving out subjects
• ByObservation for leaving out individual observations per subject

Each of the data splitting methods will be discussed in the next section.

Similar to the resampling-based crossvalidation, the constructor for a PSIS-CV method is:

cv_method = PSISCrossvalidation(; split_method, split_by, filter_nans = true, pareto_shape_threshold = 0.7)

This defines an instance of the PSISCrossvalidation algorithm for crossvalidation. The split_method and split_by keyword arguments are similar to the resampling-based CV case. The filter_nans and pareto_shape_threshold keyword arguments can be used to filter out the runs where the importance sampling weight calculation or smoothing operation failed (filter_nans = true) or where the shape parameter was more than pareto_shape_threshold. Filtering those runs out when computing the ELPD can be useful to avoid a few bad runs rendering the whole PSIS-CV method useless. Ideally in those cases, we should re-run the sampling instead of filtering them out but this is currently not supported in Pumas.

### Data splitting methods

The split_method keyword argument determines the way by which we split the data points. split_method can be an instance of any of the following types:

• LeaveK for leave-K-out crossvalidation
• KFold for K-fold crossvalidation
• LeaveFutureK for leaving K future points at a time

The split_by keyword argument determines what a data point and what likelihood to compute, e.g. marginal or conditional. split_by can be an instance of any of the following types:

• BySubject for leaving out subjects
• ByObservation for leaving out individual observations per subject

These types are all explained below.

#### Leave-K-out

The constructor for the leave-K-out data splitting algorithm is:

split_method = LeaveK(; K = 5, shuffle = false, rng = nothing)

In this algorithm, the data is split multiple times into 2 disjoint groups, each time starting from the full data set. The 2 groups are typically called training and validation subsets, where the validation subset has K data points. In the next iteration, the whole data set is re-split using another disjoint subset of K data points as the validation subset. This process is done repeatedly until almost each data point has shown up in 1 and only 1 validation subset.

The data is typically a vector of some sort, e.g. observations or subjects, and the splittings are order-dependent. Before performing the splittings, you can randomly shuffle the data vector by setting the shuffle keyword argument to true (default is false) getting rid of the sensitivity to the original order of the data. You can additionally pass an optional pseudorandom number generator rng to control the pseudo-randomness for reproducibility.

Assume the original data is ["A", "B", "C", "D"]. Leave-one-out splitting without shuffling results in the following splittings of the data:

Training subsetValidation subset
["A", "B", "C"]["D"]
["A", "B", "D"]["C"]
["A", "C", "D"]["B"]
["B", "C", "D"]["A"]

where each data point showed once and only once in a validation subset.

Leave-2-out splitting without shuffling results in the following splittings:

Training subsetValidation subset
["A", "B"]["C", "D"]
["C", "D"]["A", "B"]

#### K-fold

The constructor for the K-fold data splitting algorithm is:

split_method = KFold(; K = 5, shuffle = false, rng = nothing)

In this algorithm, the data is split K times into 2 disjoint groups, each time starting from the full data set. The 2 groups are typically called training and validation subsets, where the validation subset has floor(N / K) data points, N being the total number of data points. In the next iteration, the whole data set is re-split using another disjoint validation subset of floor(N / K) different points, disjoint from the the previous validation subsets. This process is done iteratively until almost each data point has shown up in 1 and only 1 validation subset. If N is divisible by K, each point will show up in 1 and only 1 validation subset. Otherwise, the remainder points will be part of the training subset for all the splittings and will not show up in any validation subset.

The data is typically a vector of some sort, e.g. observations or subjects, and the splittings are order-dependent. Before performing the splittings, you can randomly shuffle the data vector by setting the shuffle keyword argument to true (default is false) getting rid of the sensitivity to the original order of the data. You can additionally pass an optional pseudorandom number generator rng to control the pseudo-randomness for reproducibility.

Assume the original data is ["A", "B", "C", "D"]. 4-fold splitting without shuffling results in the following splittings of the data:

Training subsetValidation subset
["A", "B", "C"]["D"]
["A", "B", "D"]["C"]
["A", "C", "D"]["B"]
["B", "C", "D"]["A"]

where each data point showed once and only once in a validation subset.

2-fold splitting without shuffling results in the following splittings:

Training subsetValidation subset
["A", "B"]["C", "D"]
["C", "D"]["A", "B"]

#### Leave-future-K

The constructor for the leave-future-K data splitting algorithm is:

split_method = LeaveFutureK(; K = 1, minimum = 2)

In this algorithm, the data is assumed to be a time series. The goal is to split the data into "past" and "future". Using this algorithm, the data is split multiple times into 3 disjoint groups where the third group is discarded, each time starting from the full data set. The first 2 groups are typically called the past/training subset and the future/validation subset, where the future validation subset has K future data points. In the next iteration, the whole data set is then re-split using another disjoint subset of K data points as the future validation subset. This process is done iteratively until the training set has no less than minimum number of points remaining. Using this method, each data point can show up in at most 1 future validation subset. The default values of K and minimum are 1 and 2 respectively.

Assume the original data is ["A", "B", "C", "D", "E", "F"]. Leave-1-future-out splitting with minimum = 2 results in the following splittings of the data where the third column is discarded:

Past training subsetFuture validation subsetDiscarded subset
["A", "B", "C", "D", "E"]["F"][]
["A", "B", "C", "D"]["E"]["F"]
["A", "B", "C"]["D"]["E", "F"]
["A", "B"]["C"]["D", "E", "F"]

Leave-2-future-out splitting with minimum = 2 results in the following splittings:

Past training subsetFuture validation subsetDiscarded subset
["A", "B", "C", "D"]["E", "F"][]
["A", "B"]["C", "D"]["E", "F"]

#### Subject-based splitting

The constructor for the subject-based splitting method is:

split_by = BySubject(; marginal = LaplaceI())

Using this method, each subject is treated as a single data point. The predictive loglikelihood computed for each subject can be either the marginal loglikelihood or conditional loglikelihood.

This method has one keyword argument, marginal. If marginal is set to nothing, the predictive loglikelihood computed for each subject is the conditional loglikelihood using 0s for the subject-specific parameters. Otherwise, the predictive loglikelihood computed for each subject is the marginal loglikelihood using marginal as the marginalization algorithm.

The default value of marginal is LaplaceI() which uses the Laplace method to integrate out the subject-specific parameters. Other options include: FOCE() and FO().

#### Observation-based splitting

The constructor for the observation-based splitting method is:

split_by = ByObservation(; allsubjects = true)

Using this method, each observation or collection of observations is treated as a single data point. When computing predictive loglikelihoods using this method, the predictive loglikelihood computed is the conditional loglikelihood of one or more observations for one or more subjects.

This method has one keyword argument, allsubjects. If allsubjects is set to true (the default value), the ith observation of each subject are all grouped together into a single data point. This assumes all subjects have the same number of observations. If allsubjects is set to false, then each observation for each subject is its individual data point.

Assume there are 2 subjects and 3 observations per subject. When using split_method = LeaveK(K = 1) as the splitting method together with split_by = ByObservation(allsubjects = false), the training and validation splittings are:

Training subsetValidation subset
Subj 1 (obs 1, 2, 3), subj 2 (obs 1, 2)Subj 2 (obs 3)
Subj 1 (obs 1, 2, 3), subj 2 (obs 1, 3)Subj 2 (obs 2)
Subj 1 (obs 1, 2, 3), subj 2 (obs 2, 3)Subj 2 (obs 1)
Subj 1 (obs 1, 2), subj 2 (obs 1, 2, 3)Subj 1 (obs 3)
Subj 1 (obs 1, 3), subj 2 (obs 1, 2, 3)Subj 1 (obs 2)
Subj 1 (obs 2, 3), subj 2 (obs 1, 2, 3)Subj 1 (obs 1)

On the other hand if allsubjects is set to true, the training and validation splittings are:

Training subsetValidation subset
Subj 1 (obs 1, 2), subj 2 (obs 1, 2)Subj 1 (obs 3), subj 2 (obs 3)
Subj 1 (obs 1, 3), subj 2 (obs 1, 3)Subj 1 (obs 2), subj 2 (obs 2)
Subj 1 (obs 2, 3), subj 2 (obs 2, 3)Subj 1 (obs 1), subj 2 (obs 1)

### Examples

Assume there are 5 subjects and 10 observations per subject and that res is the result of the fit or truncate functions. The following are some of the combinations in which the above inputs can be used:

• Leave-one-observation-out cross-validation, leaving 1 observation for all the subjects at a time. allsubjects = true means that the same observation index is removed for all the subjects, e.g. the 10th observation for all the subjects is used for validation in the first run, then the 9th observation is used for validation in the second run, etc.
split_method = LeaveK(K = 1)
split_by = ByObservation(allsubjects = true)
cv_method = ExactCrossvalidation(; split_method = split_method, split_by = split_by, ensemblealg = EnsembleThreads())
cv_res = crossvalidate(res, cv_method)
• Leave-two-future-observations-out cross-validation, leaving 2 future observations per subject such that no less than 4 observations are used in training. allsubjects = false means that only the data of one subject at a time will be split. So in the first run, only observations 1 to 4 of subject 1 are used for training and observations 5 and 6 of subject 1 are used for validation. In the second run, observations 1 to 6 of subject 1 are used for training and observations 7 and 8 are used for validation. In the third run, observations 1 to 8 of subject 1 get used for training and observations 9 and 10 are used for validation. For all 3 runs, all the observations of subjects 2 to 5 are used in training. Then in the fourth to sixth runs, subject 2's data gets split. In the seventh to ninth runs, subject 3's data gets split, etc.
split_method = LeaveFutureK(K = 2, minimum = 4)
split_by = ByObservation(allsubjects = false)
cv_method = ExactCrossvalidation(; split_method = split_method, split_by = split_by, ensemblealg = EnsembleThreads())
cv_res = crossvalidate(res, cv_method)
• Leave-one-subject-out cross-validation, marginalizing the subject-specific parameters using FOCE when computing predictive likelihoods. Using this method, subjects {1, 2, 3, 4} are used for training in the first run and subject {5} is used for validation. In the second run, subjects {1, 2, 3, 5} are used for training and subject {4} is used for validation, etc. The predictive likelihoods computed are the marginal likelihood of the subject computed using FOCE for marginalizing the subject-specific parameters. LaplaceI or FO could have also been used instead.
split_method = LeaveK(K = 1)
split_by = BySubject(marginal = FOCE())
cv_method = ExactCrossvalidation(; split_method = split_method, split_by = split_by, ensemblealg = EnsembleThreads())
cv_res = crossvalidate(res, cv_method)
• Leave-one-future-subject-out cross-validation, where the predictive likelihoods are conditional likelihoods computed using 0s for the subject-specific parameters instead of marginalizing. The minimum number of subjects is 3 so there are only 2 runs. In the first run, the training subjects are {1, 2, 3} and the validation subject is {4}. In the second run, the training subjects are {1, 2, 3, 4} and the validation subject is {5}.
split_method = LeaveFutureK(K = 1; minimum = 3)
split_by = BySubject(marginal = nothing)
cv_method = ExactCrossvalidation(; split_method = split_method, split_by = split_by, ensemblealg = EnsembleThreads())
cv_res = crossvalidate(res, cv_method)

### Expected log predictive density (ELPD)

To estimate the ELPD using the output cv_res from the crossvalidate function, you can use:

elpd(cv_res)

This can be used to compare models based on their predictive accuracy on unseen data.