Estimating Parameters of Pumas Models
Pumas can use the observational data of a Subject
or Population
to estimate the parameters in many types of models. This is done by two classes of methods. First, maximum likelihood methods find the parameters such that the observational data has the highest probability of occurring according to the chosen error distributions. Second, Bayesian methods find a posterior probability distribution for the parameters to describe the chance that a parameter has a given value given the data. The following section describes how to fit an NLME model in Pumas via the two methods.
Maximum likelihood methods find the parameters such that the observational data has the highest probability of occurring according to the chosen error distributions.
Bayesian methods find a posterior probability distribution for the parameters to describe the chance that a parameter has a given value given the data.
Maximum A Posteriori (MAP) can be considered as something in between maximum likelihood estimation and Bayesian inference. It incorporates priors of some or all variables and can be interpreted as penalized maximum likelihood or finding the mode of the posterior distribution in a Bayesian context.
The following sections describe how to fit a NLME model in Pumas via the two methods.
Defining Data for Estimation
The observed data should be parsed using the same names as those found in the model. For example, if subject.observations
is a NamedTuple
with names dv
and resp
, the @derived
block or function in the PumasModel
should define distributions with matching names. Similarly, the error
block should define distributions with matching names if using a PumasEMModel
instead. If dv
is a scalar in the observation data, then dv
from @derived
should also be a scalar. Likewise, if dv
is an array like a time series, then dv
should be a size-matching time series when returned from @derived
. The likelihood of observing multiple dependent variables is calculated under the assumption of independence between the two.
Maximum Likelihood Estimation
Maximum Likelihood Estimation (MLE) is performed using the fit
function:
StatsAPI.fit
— Methodfit(
model::PumasModel,
population::Population,
param::NamedTuple,
approx::Union{LikelihoodApproximation, MAP};
optim_alg = Optim.BFGS(linesearch = Optim.LineSearches.BackTracking(), initial_stepnorm = 1.0),
optim_options::NamedTuple = NamedTuple(),
optimize_fn = nothing,
constantcoef::Tuple = (),
ensemblealg::SciMLBase.EnsembleAlgorithm = EnsembleThreads(),
checkidentification = true,
diffeq_options = NamedTuple(),
verbose = true,
ignore_numerical_error = true,
)
Fit the Pumas model model
to the dataset population
with starting values param
using the estimation method approx
. Currently supported values for the approx
argument are FO
, FOCE
, LaplaceI
, NaivePooled
, and BayesMCMC
. See the online documentation for more details about the different methods.
To control the optimization procedure for finding population parameters (fixed effects), use the optim_alg
and optim_options
keyword arguments. In previous versions of Pumas the argument optimize_fn
was used, but is now discouraged and will be removed in a later version of Pumas. These options control the optimization all approx
methods except BayesMCMC
. The default optimization function uses the quasi-Newton routine BFGS
method from the Optim
package. It can be changed by setting the optim_alg
to an algorithm implemented in Optim.jl as long as it does not use second order derivatives. Optimization specific options can be passed in using the optim_options
keyword and has to be a NamedTuple with names and values that match the Optim.Options
type. For example, the optimization trace can be disabled and the algorithm can be changed to L-BFGS by passing optim_alg=Optim.LBFGS(), optim_options = ;(show_trace=false)
to fit
. See Optim
for more defails.
It is possible to fix one or more parameters of the fit by passing a Tuple of Symbols as the constantcoef
argument with elements corresponding to the names of the fixed parameters, e.g. constantcoef=(:σ,)
.
When models include an @random
block and fitting with NaivePooled
is requested, it is required that the user sets all variability parameters to zero with constantcoef
such that these can be ignored in the optimization, e.g. constantcoef = (:Ω,)
while overwriting the corresponding values in param
with (; init_params..., Ω = zeros(3, 3))
.
Parallelization of the optimization is supported for most estimation methods via the ensemble interface of DifferentialEquations.jl. Currently, the only supported options are:
EnsembleThreads()
: the default. Accelerate fits by using multiple threads.EnsembleSerial()
: fit using a single thread.EnsembleDistributed()
: fit by using multiple worker processes.
The fit
function will check if any gradients and throw an exception if any of the elements are exactly zero unless checkidentification
is set to false
.
Further keyword arguments can be passed via the diffeq_options
argument. This allows for passing arguments to the differential equations solver such as alg
, abstol
, and reltol
. The default values for these are AutoVern7(Rodas5P(autodiff=true))
, 1e-12
, and 1e-8
respectively. See the DifferentialEquations.jl documentation for more details.
The keyword verbose
controls if info statements about initial evaluations of the loglikelihood function and gradient should be printed or not. Defaults to true.
Since the numerical optimization routine can sometimes visit extreme regions of the parameter space during it's exploration we automatically handle the situation where the input parameters result in errors when solving the dynamical system. This allows the algorithm to recover and continue in many situations that would have otherwise stalled early. Sometimes, it is useful to turn this error handling off when debugging a model fit, and this can be done by setting ignore_numerical_error = false
.
The return type of fit
is a FittedPumasModel
.
Optimization option
As can be seen from the docstring
above, we can set options related to optimization of the parameters using optim_alg
and optim_options
for the fitting methods that are based on optimization of fixed effects (population parameters). Methods that do not require such maximization over fixed effects would be BayesMCMC
and SAEM
. Additionally, there is an inner level of optimization for the methods based on Laplace's method such as FOCE
and LaplaceI
.
We will expose the details of these two levels of optimization below.
Fixed Effects Optimization Options
When running fit
we usually see information about the progress of the optimization procedure as it moves along. This information is called the trace
and we print it by default. The trace can be used to monitor that the objective function is decreasing and the inf-norm of the gradient is also printed (that is, the absolute value of the largest gradient element). In our case, the objective is the negative log-likelihood and we want to minimize it.
Mathematically speaking there is no difference between maximizing the log-likelihood and minimizing the negative log-likelihood. Most optimization theory and software follows the convention of approaching optimization from a minimization stand point.
When running a fit we see the trace first:
[ Info: Checking the initial parameter values.
[ Info: The initial negative log likelihood and its gradient are finite. Check passed.
Iter Function value Gradient norm
0 4.658424e+01 1.432126e+01
* time: 0.016541004180908203
1 4.588237e+01 7.059232e+00
* time: 0.24493002891540527
2 4.562448e+01 3.745015e-01
* time: 0.24531102180480957
3 4.562379e+01 1.612342e-02
* time: 0.24547505378723145
4 4.562379e+01 2.656493e-05
* time: 0.24556899070739746
and then we see the printout of the results based on the FittedPumasModel
FittedPumasModel
Successful minimization: true
Likelihood approximation: FOCE
Likelihood Optimizer: BFGS
Dynamical system type: Closed form
Log-likelihood value: -45.623789
Number of subjects: 10
Number of parameters: Fixed Optimized
2 1
Observation records: Active Missing
dv: 20 0
Total: 20 0
------------------
Estimate
------------------
θ₁ 0.36476
Ω₁,₁ 0.04
σ 0.1
------------------
As we can see, the final value of the objective in the trace is 4.562379e+01
and the printed log-likelihood in the fit results is -45.623789
. This shows the point that we minimize the negative log-likelihood and later print the actual log-likelihood.
The solution to our optimization problem is a parameter vector such that in the neighborhood of that vector there are no other parameter vectors that give us a lower objective value. For a differentiable objective function the minimum is found at a place where the gradient is zero and the function is locally convex. This is why we monitor that the objective is decreasing and we also look at the gradient elements to see that they are getting small. There is a big caveat to mention and that is that the objective and gradients do not necessarily have a natural scale by which we can measure it for convergence. For a population of ten subjects we can evaluate the log-likelihood and the gradient of the log-likelihood. If we take ten identical subjects and add them to the population we now have a log-likelihood that is twice as high and the gradient elements are doubled as well. This tells us that we should in principle adjust our tolerances to the problem even if it can be quite hard to know what the acceptable values of the stopping tolerances are from any given problem. We can only base it off of experience and trial-and-error.
Even if it can be hard to know what the tolerances should be it is important to know how to do so. If the optimization fails or stalls it is also important to know what options are available to progress.
Setting optim_options
Any mention of change in, or values of, parameters are referring to the transformed variables. Since it is possible to put lower and upper bounds on parameters and restrict other parameters to be positive definite we need to transform the fixed effects into a domain that can be used with unconstrained optimization.
The most useful settings are generally found in the optim_options
. This is where we control what information is printed, what the stopping tolerances are, or add constraints on the number of iterations we allow if we are afraid the fit might otherwise run for too long. Many of these parameters are directly related to the output that can be found by accessing the optim
field of a fit
result. The full list of options can be found in the Optim documentation. The ones that are relevant to a Pumas user calling are (with default values in square brackets []) listed below. The possible termination criteria are:
x_abstol [= 0.0]
terminate if the largest parameter change is less thanx_abstol
x_reltol [= 0.0]
terminate if the largest change relative to the previous parameter value is less thanx_reltol
f_abstol [= 0.0]
terminate if the objective change is less thanf_abstol
f_reltol [= 0.0]
terminate if the objective change relative to the previous objective value is less thanf_reltol
g_abstol [= 1e-5]
terminate if the largest absolute gradient element is less thang_abstol
iterations [= 1_000]
terminate if the number of iterations exceediterations
time_limit [= NaN]
terminate if the time since starting the optimization exceedstime_limit
at the start of an iteration (no termination during iterations)
It is possible to control what is printed as part of the trace with the following settings:
show_trace [= false]
whether to show the trace during fittingstore_trace [= false]
whether to store the values printed during fitting in an object to be fetched laterextended_trace [= false]
whether to show and store more than just the objective function and gradient norm. This includes the full gradient vector, the Hessian approximation, as well as the step length for line search based methods.show_warnings [= true]
whether to print warnings about NaNs in gradientsshow_every [= 1]
controls how often to print the information
Setting optim_alg
Setting the optimization algorithm can affect the final parameters in datasets and with models that may be challenging due to poor scaling of the dynamical model, hard to identify parameters, and so on. If there are multiple minima one algorithm may by chance go to one minimum and another might go to another minimum depending on how they generate their search direction and step proposals. The standard algorithm in Pumas is the BFGS
implementation in Optim.jl
. To be completely specific, the default is:
Optim.BFGS(
# Use a simple line search strategy
linesearch = Optim.LineSearches.BackTracking(),
# Make sure that first step isn't too large
initial_stepnorm = 1.0,
)
The two default settings for linesearch
and initial_stepnorm
are chosen according to practical experience.
A numerical optimization algorithm based on a line search will perform a sequence of steps that are out of scope for this Pumas documentation. However, there is something we can say on a high level about the choice of line search. BFGS has been shown to have good properties in a wide range of models theoretically and in practice. Based on the current iterate (parameter vector), objective value, gradient and the history of gradients and iterates it proposes a search direction and then performs what is called a line search. This means that it searches in a straight line defined by the current value and the search direction. A step length of 1 would then make the new proposal equal to the current iterate plus the search direction (sum of two vectors). However, this step may overstep the minimum such that the objective value is actually higher at this new proposal or the decrease may not be as small as we may have expected given our gradient. The point of the line search procedure is then to explore what improvements we may get if we replace x + s
where x
is the current iterate and s
is the search direction with x + α * s
. The line search algorithm controls how α
is varied to search for a good next iterate x'
. BackTracking
is a simple strategy that works well in practice and is relatively cheap per iteration because it does not require additional gradients to be calculated. However, theoretically speaking BFGS requires a more strict line search algorithm to work. If you experience issues with convergence it may be worth setting the linesearch
keyword to something like StrongWolfe()
:
Optim.BFGS(
# Use a simple line search strategy
linesearch = Optim.LineSearches.StrongWolfe(),
# Make sure that first step isn't too large
initial_stepnorm = 1.0,
)
The other option that we set by default is initial_stepnorm = 1.0
. This is because the objective value and gradient elements can often be quite large in NLME models unless the starting value is close to the optimum. This sometimes leads to very extreme first steps that can lead to failure or that the optimization algorithm moves to bad regions of the parameter domain that can be hard to get out of again. After the Hessian is updated we get more information about the scale of the problem and the steps should become less extreme.
Stalled or failed optimizations
Sometimes we may observe that our fit stopped with large gradient elements or with several iterations in a row that produce only very small improvements to the objective value. The returned solution may be the best solution we can find in numerical terms if the problem is close to being (practically) unidentifiable, if there is numerical noise in the solution, or other properties that makes the minimum hard to pin down. Sometimes the optimization procedure can simply have ended up with a bad approximation to the inverse Hessian through the BFGS
updates.
No matter what the reason is for the early stopping or stalling of the fit we can try some different things to see if our model cannot converge anyway. Before proceeding we should of course always make sure that our data is correctly parsed and that our model is correctly coded up. If we are confident that we have no mistakes in the data or model we can try the following.
Assume that the original, failing fit was performed as follows
result1 = fit(model, population, param0, FOCE())
Then, we can proceed in a few different ways depending on what might be wrong. It may be possible that the starting values should be changed. There are various ways to change the starting values. If the new guess is called param1
we simply try:
result2 = fit(model, population, param1, FOCE())
This time we may see the results we want, we may see the same issue we originally saw, or something else might go wrong. If the new starting value did not improve the situation, we may go back to our original fit and try another approach.
If we had originally observed that the progress in the trace stalled we may try to simply restart the optimization at the final parameters. If the reference is result1
from above this can be achieved as follows:
result3 = fit(model, population, coef(result1), FOCE())
This will start at the final fixed effects of the previous fit, but it will reset the optimization state. This may help the optimization procedure escape a bad state. Restarting the fit this way also restarts the empirical bayes estimates. Depending on the situation this may or may not be what you want. The alternative way of restarting is to use the following signature that will reuse both final fixed effects and empirical bayes estimates but still restarts the optimization related state:
result4 = fit(results1; reset_optim = true)
If the optimization still does not terminate to your satisfaction it may be that there are some issues with the model and/or data or it may simply not be possible to optimize the parameters to the requested precision.
Empirical Bayes Estimates Optimization Options
In practice, setting good options for the inner optimization of the empirical bayes estimate (EBE) optimization problem can be hard. Since we perform one optimization per subject per outer iteration the information is simply hard to process and analyze when considering a full fit. Consider turning on the inner trace and imagine how much information will be printed to the screen.
The inner optimization algorithm is a trust region based algorithm that is different to the outer optimization default. In the inner optimization we have access to a relatively cheap positive definite approximation to the Hessian that we use in the trust region based algorithm. If this approximation is not too far from the true Hessian we can get rapid convergence in the inner optimizations and that is what we often see. Sometimes, the inner optimizations can spend an excess amount of time compared to what we might expect. This can happen if the positive definiteness of the Hessian does not match the properties of the real Hessian. It is not always easy to diagnose this issue, but if each iteration is taking longer than expected it is possible to temporarily turn on the inner trace and see if the iteration numbers are getting quite high. Since we're using (approximately) Newton's method we often expect quite few iterations especially as the outer optimization algorithm converges. If that is not the case it may sometimes help to set the inner optimization to Optim.BFGS
or Optim.LBFGS
instead after using Optim
.
The inner optimization is done to be able to compute a cheap integral over the random effects using Laplace's method and the various formulas and calculations depend on the solution to be found in the interior, but often we will just assume that the optimum is indeed found as long as our model is relatively well-behaved.
There are many known pitfalls to the interior optimization. For example, the interior optimization is not necessarily continuous or differentiable if the model contains things such as lag
s. The interior problem may also have several local-minima when lag
s or other "time altering" parameters are part of the specification.
It is generally not advised to tweak the inner optimization options too much, but if there is a need to do it it is possible. The interface for controlling these options is the same as for the outer layer of optimization, but instead of giving the options to fit
the optim_alg
and optim_options
can be given to the approximation method constructor:
fit(
model,
population,
parameters,
FOCE(; optim_options = (; g_tol = 1e-8));
optim_options = (; g_tol = 1e-2),
)
The above would set the inner gradient tolerance to 1e-8 and the outer gradient tolerance to 1e-2. Notice, that in general the inner optimization problem should not be solved too loosely as it will introduce further numerical error (in addition to the existing approximation error) in the calculations of the log-likelihood and derived quantities.
For the fixed effects optimization we suggested starting from multiple starting values. This is hard in the inner optimization problem even if the same logic could be applied here. Currently, it is only possible to change the path of EBEs by changing the starting parameters to fit
as the sequence of inner optimizations will also change because it is conditional on the fixed effects parameters.
Mapping NONMEM $ESTIMATION
concepts to Pumas options
If you are moving from NONMEM to Pumas or if you are comparing models it may be useful to know how to set similar settings when fitting. Before we move to the more granular settings we can consider the METHOD
option.
In NONMEM, the METHOD=
options are used to control the approximation method. In Pumas, we write:
fit(model, population, parameters, FO())
fit(model, population, parameters, FOCE())
fit(model, population, parameters, LaplaceI())
To compare this with NONMEM consider the list of METHOD
or METH
settings:
The INTERACTION
and NOINTERACTION
option is not applicable in Pumas. In Pumas, the "interaction" is always included. I.e. if the variability in the error model depends on random effects then this effect is taken into account in the approximation of the marginal likelihood.
0
orZERO
is the same asFO()
in Pumas.METHOD=1
orMETHOD=CONDITIONAL
orMETH=COND
orMETH=COND NOLAPLACIAN
isFOCE()
.METHOD=CONDITIONAL LAPLACE
orMETH=COND LAPLACE
isLaplaceI()
. Alternative toMETH=COND LAPLACE
users may writeLAPLACIAN
in NONMEM.
The setting of CENTERING
is not supported in Pumas. This is supposed to be a method that penalizes eta
s to be close to 0 which is suggested in the NONMEM documentation to be useful in some situations where there is model misspecification and the mean of the empirical bayes estimates are far from zero.
Further settings that users may be familiar with from NONMEM include:
NOABORT
is a NONMEM setting that allows the optimization to recover from problematic theta values. Pumas has a similar handling of parameters that lead to model solution errors. Infit
it is possible to setignore_numerical_error = false
and you will see the error message instead of having Pumas try to recover from the bad parameters.MAXEVALS
is a NONMEM setting that allows the user to restrict the number of objective function evaluations to allow. In Pumas, there is not a directly comparable version of this because of the way function calls are counted. It is possible to constrain the number of evaluations through theiterations
counter that will restrict how many times we allow for an update of the search direction and a search for the next solution candidate. Similar toMAXEVALS=0
we can setoptim_options = (;iterations=0,)
and then only the empirical bayes estimates will be calculated, the fixed effects will not be optimized. A more direct approach is to work with functions such asloglikelihood
,empirical_bayes
, andpredict
.LNTWOPI
Is an option that allows for the inclusion fo the constant termN*LOG(2pi)
where N is the number of normally distributed data. In Pumas we always include the constant.PARAFILE=filename
orPARAFILE=ON
/PARAFILE=OFF
turns parallelization on and off or set according to a specification infilename
. In Pumas, we control the parallelization through theensemblealg
keyword tofit
. The default isEnsembleThreads()
butEnsembleSerial()
(no parallelization) orEnsembleDistributed()
(distributed computing) can also be set.PRINT=n
is similar to theshow_every
option mentioned earlier in this document.SIGDIGITS=n
is described as "Number of significant digits required in the final parameter estimate." in NONMEM. Defaults to3
. In Pumas, a similar setting isoptim_options=(; x_reltol=1e-3,)
.
A setting that is set in $ESTIMATION
in NONMEM that does not belong in optim_options
in Pumas is ATOL
. ATOL
controls the absolute tolerance of the solution of the dynamical system. In NONMEM, the ATOL
is deprecated $ESTIMATION
as well and is meant to be set in $SUBROUTINE
or $TOL
instead. In Pumas, the place to set this setting is also in the differential equation options. This means that it should be set in diffeq_options
and the default value is abstol=1e-12
which corresponds exactly to the default NONMEM setting of 12
(note, in NONMEM the exponent is set while in Pumas the number itself is set).
Other options to $ESTIMATION
do not belong in the fit
step of Pumas.
POSTHOC
controls if empirical bayes estimates /ETA
s are estimated or not after the fit is complete. In Pumas, useempirical_bayes
andicoef
functionality for this instead. These functions can be called in isolation or as part of the diagnostic functioninspect
that calculates and collects many quantities that are relevant to model diagnostics.LIKELIHOOD
allows the user to communicate thatY
orF
is to be interpreted in a special way, mostly for non-continuous observed responses. In Pumas, all aspect of the the likelihood contributions happen through the@derived
error model specification, so theLIKELIHOOD
option does not apply in Pumas. If an appropriate distribution does not exist it is possible to ask for it to be provided in Pumas or to implement it yourself as a distribution rather than just writing up some likelihood contribution.-2LOGLIKELIHOOD
as above.NOCOV=[0|1]
in$ESTIMATION
and$COV
settings that control covariance matrix calculations in general. In Pumas, this is not part iffit
under any circumstance as we delegate this to a later part of the workflow using theinfer
function that often follows theinspect
aspect of the workflow.
Marginal Likelihood Approximations
The following choices are available for the likelihood approximations:
FO()
: first order approximation.FOCE()
: first order conditional estimation (with automatic detection for when to include the interaction terms).LaplaceI()
: second order Laplace approximation with interaction.
Only LaplaceI()
is supported when using a PumasEMModel
:
StatsAPI.fit
— Methodfit(
model::PumasEMModel,
population::Population,
param::NamedTuple,
approx::Union{SAEM,LaplaceI};
ensemblealg = EnsembleThreads(),
rng = default_rng()
)
Fit the PumasEMModel
to the dataset population
using the initial values specified in param
. The supported methods for the approx
argument are currently SAEM
and LaplaceI
. See the online documentation for more details about these two methods.
When fitting with SAEM
the variance terms of the random effects (Ω
) and the dispersion parameters for the error model (σ
) are initialized to the identity matrix or 1
as appropriate. They may also be specified. With SAEM, it is recommended to choose an init larger than the true values to facilitate exploration of the parameter space and avoid getting trapped in local optima early. Currently LaplaceI
fits with a PumasEMModel
require Ω
to be diagonal.
Options for ensemblealg
are EnsembleSerial()
and EnsembleThreads()
(the default). The fit will be parallel if ensemblealg == EnsembleThreads()
. SAEM imposes the additional requirement for threaded fits that either the rng
used supports Future.randjump
or that rng
be a vector of rngs, one per thread. The default_rng()
supports randjump()
.
FittedPumasModel
The relevant fields of a FittedPumasModel
are:
model
: themodel
used in the estimation process.data
: thePopulation
that was estimated.optim
: the result returned by the optimizer.approx
: the marginal likelihood approximation that was used.param
: the optimal parameters.
Bayesian Estimation
Bayesian parameter estimation is performed by using the fit
function as follows:
StatsAPI.fit
— MethodDistributions.fit(
model::PumasModel,
data::Population,
param::NamedTuple,
alg::BayesMCMC,
)
Samples from the posterior distribution of the model
's population and subject-specific parameters given the observations in data
using the Hamiltonian Monte Carlo (HMC) based No-U-Turn Sampler (NUTS) algorithm. The MCMC sampler is initialized starting from param
for the population parameters. The subject-specific parameters are also apprioriately initialized. The Bayesian inference algorithm's options are passed in the alg
struct. Run ?BayesMCMC
for more details on the options that can be set.
We use a NUTS
sampler with the generalized no-U-turn termination criterion and multinomial sampling on an Hamiltonian system whose kinetic energy is specified with a diagonal metric (diagonal matrix with positive diagonal entries). For numerical integration of the Hamiltonian system, we use the ordinary leapfrog integrator. The adaptation is done using the Stan’s windowed adaptation routine with a default target acceptance ratio of 0.8
.
The result is a BayesMCMCResults
type.
BayesMCMCResults
The MCMC chain is stored in the chain
field of the returned BayesMCMCResults
. Additionally, the result can be converted into a Chains
object from MCMCChains
, allowing utilization of diagnostics and visualization tooling. This is discussed further in the Bayesian Workflow tutorial.
Maximum A Posteriori (MAP) Estimation
MAP estimation is carried out using the fit
function and by wrapping the chosen likelihood approximation in the MAP
type. The fit
signature is the same as for maximum likelihood. If we have a model (model
), a Population
(data
), and an initial parameter specification (param
) we can carry out the MAP estimation with the FOCE approximation of the log-likelihood using
result = fit(model, data, param, MAP(FOCE()))
The usual post-processing and diagnostics are available and are calculated according to the approximation wrapped in MAP
.