Estimating Parameters of Pumas Models

Pumas can use the observational data of a Subject or Population to estimate the parameters in many types of models. This is done by two classes of methods. First, maximum likelihood methods find the parameters such that the observational data has the highest probability of occurring according to the chosen error distributions. Second, Bayesian methods find a posterior probability distribution for the parameters to describe the chance that a parameter has a given value given the data. The following section describes how to fit an NLME model in Pumas via the two methods.

Maximum likelihood methods find the parameters such that the observational data has the highest probability of occurring according to the chosen error distributions.
Bayesian methods find a posterior probability distribution for the parameters to describe the chance that a parameter has a given value given the data.
Maximum A Posteriori (MAP) can be considered as something in between maximum likelihood estimation and Bayesian inference. It incorporates priors of some or all variables and can be interpreted as penalized maximum likelihood or finding the mode of the posterior distribution in a Bayesian context.

The following sections describe how to fit a NLME model in Pumas via the two methods.

Defining Data for Estimation

The observed data should be parsed using the same names as those found in the model. For example, if subject.observations is a NamedTuple with names dv and resp, the @derived block or function in the PumasModel should define distributions with matching names. Similarly, the error block should define distributions with matching names if using a PumasEMModel instead. If dv is a scalar in the observation data, then dv from @derived should also be a scalar. Likewise, if dv is an array like a time series, then dv should be a size-matching time series when returned from @derived. The likelihood of observing multiple dependent variables is calculated under the assumption of independence between the two.

Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE) is performed using the fit function:

StatsAPI.fit — Method

fit(
  model::PumasModel,
  population::Population,
  param::NamedTuple,
  approx::Union{LikelihoodApproximation, MAP};
  optim_alg = Optim.BFGS(linesearch = Optim.LineSearches.BackTracking(), initial_stepnorm = 1.0),
  optim_options::NamedTuple = NamedTuple(),
  optimize_fn = nothing,
  constantcoef::Tuple = (),
  ensemblealg::SciMLBase.EnsembleAlgorithm = EnsembleThreads(),
  checkidentification = true,
  diffeq_options = NamedTuple(),
  verbose = true,
  ignore_numerical_error = true,
  )

Fit the Pumas model model to the dataset population with starting values param using the estimation method approx. Currently supported values for the approx argument are FO, FOCE, LaplaceI, NaivePooled, and BayesMCMC. See the online documentation for more details about the different methods.

To control the optimization procedure for finding population parameters (fixed effects), use the optim_alg and optim_options keyword arguments. In previous versions of Pumas the argument optimize_fn was used, but is now discouraged and will be removed in a later version of Pumas. These options control the optimization all approx methods except BayesMCMC. The default optimization function uses the quasi-Newton routine BFGS method from the Optim package. It can be changed by setting the optim_alg to an algorithm implemented in Optim.jl as long as it does not use second order derivatives. Optimization specific options can be passed in using the optim_options keyword and has to be a NamedTuple with names and values that match the Optim.Options type. For example, the optimization trace can be disabled and the algorithm can be changed to L-BFGS by passing optim_alg=Optim.LBFGS(), optim_options = ;(show_trace=false) to fit. See Optim for more defails.

It is possible to fix one or more parameters of the fit by passing a Tuple of Symbols as the constantcoef argument with elements corresponding to the names of the fixed parameters, e.g. constantcoef=(:σ,).

When models include an @random block and fitting with NaivePooled is requested, it is required that the user sets all variability parameters to zero with constantcoef such that these can be ignored in the optimization, e.g. constantcoef = (:Ω,) while overwriting the corresponding values in param with (; init_params..., Ω = zeros(3, 3)).

Parallelization of the optimization is supported for most estimation methods via the ensemble interface of DifferentialEquations.jl. Currently, the only supported options are:

EnsembleThreads(): the default. Accelerate fits by using multiple threads.
EnsembleSerial(): fit using a single thread.
EnsembleDistributed(): fit by using multiple worker processes.

The fit function will check if any gradients and throw an exception if any of the elements are exactly zero unless checkidentification is set to false.

Further keyword arguments can be passed via the diffeq_options argument. This allows for passing arguments to the differential equations solver such as alg, abstol, and reltol. The default values for these are AutoVern7(Rodas5P(autodiff=true)), 1e-12, and 1e-8 respectively. See the DifferentialEquations.jl documentation for more details.

The keyword verbose controls if info statements about initial evaluations of the loglikelihood function and gradient should be printed or not. Defaults to true.

Since the numerical optimization routine can sometimes visit extreme regions of the parameter space during it's exploration we automatically handle the situation where the input parameters result in errors when solving the dynamical system. This allows the algorithm to recover and continue in many situations that would have otherwise stalled early. Sometimes, it is useful to turn this error handling off when debugging a model fit, and this can be done by setting ignore_numerical_error = false.

The return type of fit is a FittedPumasModel.

Optimization option

As can be seen from the docstring above, we can set options related to optimization of the parameters using optim_alg and optim_options for the fitting methods that are based on optimization of fixed effects (population parameters). Methods that do not require such maximization over fixed effects would be BayesMCMC and SAEM. Additionally, there is an inner level of optimization for the methods based on Laplace's method such as FOCE and LaplaceI.

We will expose the details of these two levels of optimization below.

Fixed Effects Optimization Options

When running fit we usually see information about the progress of the optimization procedure as it moves along. This information is called the trace and we print it by default. The trace can be used to monitor that the objective function is decreasing and the inf-norm of the gradient is also printed (that is, the absolute value of the largest gradient element). In our case, the objective is the negative log-likelihood and we want to minimize it.

Mathematically speaking there is no difference between maximizing the log-likelihood and minimizing the negative log-likelihood. Most optimization theory and software follows the convention of approaching optimization from a minimization stand point.

When running a fit we see the trace first:

[ Info: Checking the initial parameter values.
[ Info: The initial negative log likelihood and its gradient are finite. Check passed.
Iter     Function value   Gradient norm 
    0     4.658424e+01     1.432126e+01
* time: 0.016541004180908203
    1     4.588237e+01     7.059232e+00
* time: 0.24493002891540527
    2     4.562448e+01     3.745015e-01
* time: 0.24531102180480957
    3     4.562379e+01     1.612342e-02
* time: 0.24547505378723145
    4     4.562379e+01     2.656493e-05
* time: 0.24556899070739746

and then we see the printout of the results based on the FittedPumasModel

FittedPumasModel

Successful minimization:                      true

Likelihood approximation:                     FOCE
Likelihood Optimizer:                         BFGS
Dynamical system type:                 Closed form

Log-likelihood value:                   -45.623789
Number of subjects:                             10
Number of parameters:         Fixed      Optimized
                                  2              1
Observation records:         Active        Missing
    dv:                          20              0
    Total:                       20              0

------------------
        Estimate
------------------
θ₁        0.36476
Ω₁,₁      0.04
σ         0.1
------------------

As we can see, the final value of the objective in the trace is 4.562379e+01 and the printed log-likelihood in the fit results is -45.623789. This shows the point that we minimize the negative log-likelihood and later print the actual log-likelihood.

The solution to our optimization problem is a parameter vector such that in the neighborhood of that vector there are no other parameter vectors that give us a lower objective value. For a differentiable objective function the minimum is found at a place where the gradient is zero and the function is locally convex. This is why we monitor that the objective is decreasing and we also look at the gradient elements to see that they are getting small. There is a big caveat to mention and that is that the objective and gradients do not necessarily have a natural scale by which we can measure it for convergence. For a population of ten subjects we can evaluate the log-likelihood and the gradient of the log-likelihood. If we take ten identical subjects and add them to the population we now have a log-likelihood that is twice as high and the gradient elements are doubled as well. This tells us that we should in principle adjust our tolerances to the problem even if it can be quite hard to know what the acceptable values of the stopping tolerances are from any given problem. We can only base it off of experience and trial-and-error.

Even if it can be hard to know what the tolerances should be it is important to know how to do so. If the optimization fails or stalls it is also important to know what options are available to progress.

Setting `optim_options`

Note

Any mention of change in, or values of, parameters are referring to the transformed variables. Since it is possible to put lower and upper bounds on parameters and restrict other parameters to be positive definite we need to transform the fixed effects into a domain that can be used with unconstrained optimization.

The most useful settings are generally found in the optim_options. This is where we control what information is printed, what the stopping tolerances are, or add constraints on the number of iterations we allow if we are afraid the fit might otherwise run for too long. Many of these parameters are directly related to the output that can be found by accessing the optim field of a fit result. The full list of options can be found in the Optim documentation. The ones that are relevant to a Pumas user calling are (with default values in square brackets []) listed below. The possible termination criteria are:

x_abstol [= 0.0] terminate if the largest parameter change is less than x_abstol
x_reltol [= 0.0] terminate if the largest change relative to the previous parameter value is less than x_reltol
f_abstol [= 0.0] terminate if the objective change is less than f_abstol
f_reltol [= 0.0] terminate if the objective change relative to the previous objective value is less than f_reltol
g_abstol [= 1e-5] terminate if the largest absolute gradient element is less than g_abstol
iterations [= 1_000] terminate if the number of iterations exceed iterations
time_limit [= NaN] terminate if the time since starting the optimization exceeds time_limit at the start of an iteration (no termination during iterations)

It is possible to control what is printed as part of the trace with the following settings:

show_trace [= false] whether to show the trace during fitting
store_trace [= false] whether to store the values printed during fitting in an object to be fetched later
extended_trace [= false] whether to show and store more than just the objective function and gradient norm. This includes the full gradient vector, the Hessian approximation, as well as the step length for line search based methods.
show_warnings [= true] whether to print warnings about NaNs in gradients
show_every [= 1] controls how often to print the information

Setting `optim_alg`

Setting the optimization algorithm can affect the final parameters in datasets and with models that may be challenging due to poor scaling of the dynamical model, hard to identify parameters, and so on. If there are multiple minima one algorithm may by chance go to one minimum and another might go to another minimum depending on how they generate their search direction and step proposals. The standard algorithm in Pumas is the BFGS implementation in Optim.jl. To be completely specific, the default is:

Optim.BFGS(
    # Use a simple line search strategy
    linesearch = Optim.LineSearches.BackTracking(),
    # Make sure that first step isn't too large
    initial_stepnorm = 1.0,
)

The two default settings for linesearch and initial_stepnorm are chosen according to practical experience.

A numerical optimization algorithm based on a line search will perform a sequence of steps that are out of scope for this Pumas documentation. However, there is something we can say on a high level about the choice of line search. BFGS has been shown to have good properties in a wide range of models theoretically and in practice. Based on the current iterate (parameter vector), objective value, gradient and the history of gradients and iterates it proposes a search direction and then performs what is called a line search. This means that it searches in a straight line defined by the current value and the search direction. A step length of 1 would then make the new proposal equal to the current iterate plus the search direction (sum of two vectors). However, this step may overstep the minimum such that the objective value is actually higher at this new proposal or the decrease may not be as small as we may have expected given our gradient. The point of the line search procedure is then to explore what improvements we may get if we replace x + s where x is the current iterate and s is the search direction with x + α * s. The line search algorithm controls how α is varied to search for a good next iterate x'. BackTracking is a simple strategy that works well in practice and is relatively cheap per iteration because it does not require additional gradients to be calculated. However, theoretically speaking BFGS requires a more strict line search algorithm to work. If you experience issues with convergence it may be worth setting the linesearch keyword to something like StrongWolfe():

Optim.BFGS(
    # Use a simple line search strategy
    linesearch = Optim.LineSearches.StrongWolfe(),
    # Make sure that first step isn't too large
    initial_stepnorm = 1.0,
)

The other option that we set by default is initial_stepnorm = 1.0. This is because the objective value and gradient elements can often be quite large in NLME models unless the starting value is close to the optimum. This sometimes leads to very extreme first steps that can lead to failure or that the optimization algorithm moves to bad regions of the parameter domain that can be hard to get out of again. After the Hessian is updated we get more information about the scale of the problem and the steps should become less extreme.

Stalled or failed optimizations

Sometimes we may observe that our fit stopped with large gradient elements or with several iterations in a row that produce only very small improvements to the objective value. The returned solution may be the best solution we can find in numerical terms if the problem is close to being (practically) unidentifiable, if there is numerical noise in the solution, or other properties that makes the minimum hard to pin down. Sometimes the optimization procedure can simply have ended up with a bad approximation to the inverse Hessian through the BFGS updates.

No matter what the reason is for the early stopping or stalling of the fit we can try some different things to see if our model cannot converge anyway. Before proceeding we should of course always make sure that our data is correctly parsed and that our model is correctly coded up. If we are confident that we have no mistakes in the data or model we can try the following.

Assume that the original, failing fit was performed as follows

result1 = fit(model, population, param0, FOCE())

Then, we can proceed in a few different ways depending on what might be wrong. It may be possible that the starting values should be changed. There are various ways to change the starting values. If the new guess is called param1 we simply try:

result2 = fit(model, population, param1, FOCE())

This time we may see the results we want, we may see the same issue we originally saw, or something else might go wrong. If the new starting value did not improve the situation, we may go back to our original fit and try another approach.

If we had originally observed that the progress in the trace stalled we may try to simply restart the optimization at the final parameters. If the reference is result1 from above this can be achieved as follows:

result3 = fit(model, population, coef(result1), FOCE())

This will start at the final fixed effects of the previous fit, but it will reset the optimization state. This may help the optimization procedure escape a bad state. Restarting the fit this way also restarts the empirical bayes estimates. Depending on the situation this may or may not be what you want. The alternative way of restarting is to use the following signature that will reuse both final fixed effects and empirical bayes estimates but still restarts the optimization related state:

result4 = fit(results1; reset_optim = true)

If the optimization still does not terminate to your satisfaction it may be that there are some issues with the model and/or data or it may simply not be possible to optimize the parameters to the requested precision.

Empirical Bayes Estimates Optimization Options

In practice, setting good options for the inner optimization of the empirical bayes estimate (EBE) optimization problem can be hard. Since we perform one optimization per subject per outer iteration the information is simply hard to process and analyze when considering a full fit. Consider turning on the inner trace and imagine how much information will be printed to the screen.

Note

The inner optimization algorithm is a trust region based algorithm that is different to the outer optimization default. In the inner optimization we have access to a relatively cheap positive definite approximation to the Hessian that we use in the trust region based algorithm. If this approximation is not too far from the true Hessian we can get rapid convergence in the inner optimizations and that is what we often see. Sometimes, the inner optimizations can spend an excess amount of time compared to what we might expect. This can happen if the positive definiteness of the Hessian does not match the properties of the real Hessian. It is not always easy to diagnose this issue, but if each iteration is taking longer than expected it is possible to temporarily turn on the inner trace and see if the iteration numbers are getting quite high. Since we're using (approximately) Newton's method we often expect quite few iterations especially as the outer optimization algorithm converges. If that is not the case it may sometimes help to set the inner optimization to Optim.BFGS or Optim.LBFGS instead after using Optim.

The inner optimization is done to be able to compute a cheap integral over the random effects using Laplace's method and the various formulas and calculations depend on the solution to be found in the interior, but often we will just assume that the optimum is indeed found as long as our model is relatively well-behaved.

Note

There are many known pitfalls to the interior optimization. For example, the interior optimization is not necessarily continuous or differentiable if the model contains things such as lags. The interior problem may also have several local-minima when lags or other "time altering" parameters are part of the specification.

It is generally not advised to tweak the inner optimization options too much, but if there is a need to do it it is possible. The interface for controlling these options is the same as for the outer layer of optimization, but instead of giving the options to fit the optim_alg and optim_options can be given to the approximation method constructor:

fit(
    model,
    population,
    parameters,
    FOCE(; optim_options = (; g_tol = 1e-8));
    optim_options = (; g_tol = 1e-2),
)

The above would set the inner gradient tolerance to 1e-8 and the outer gradient tolerance to 1e-2. Notice, that in general the inner optimization problem should not be solved too loosely as it will introduce further numerical error (in addition to the existing approximation error) in the calculations of the log-likelihood and derived quantities.

For the fixed effects optimization we suggested starting from multiple starting values. This is hard in the inner optimization problem even if the same logic could be applied here. Currently, it is only possible to change the path of EBEs by changing the starting parameters to fit as the sequence of inner optimizations will also change because it is conditional on the fixed effects parameters.

Mapping NONMEM `$ESTIMATION` concepts to Pumas options

If you are moving from NONMEM to Pumas or if you are comparing models it may be useful to know how to set similar settings when fitting. Before we move to the more granular settings we can consider the METHOD option.

In NONMEM, the METHOD= options are used to control the approximation method. In Pumas, we write:

fit(model, population, parameters, FO())
fit(model, population, parameters, FOCE())
fit(model, population, parameters, LaplaceI())

To compare this with NONMEM consider the list of METHOD or METH settings:

Note

The INTERACTION and NOINTERACTION option is not applicable in Pumas. In Pumas, the "interaction" is always included. I.e. if the variability in the error model depends on random effects then this effect is taken into account in the approximation of the marginal likelihood.

0 or ZERO is the same as FO() in Pumas.
METHOD=1 or METHOD=CONDITIONAL or METH=COND or METH=COND NOLAPLACIAN is FOCE().
METHOD=CONDITIONAL LAPLACE or METH=COND LAPLACE is LaplaceI(). Alternative to METH=COND LAPLACE users may write LAPLACIAN in NONMEM.

The setting of CENTERING is not supported in Pumas. This is supposed to be a method that penalizes etas to be close to 0 which is suggested in the NONMEM documentation to be useful in some situations where there is model misspecification and the mean of the empirical bayes estimates are far from zero.

Further settings that users may be familiar with from NONMEM include:

NOABORT is a NONMEM setting that allows the optimization to recover from problematic theta values. Pumas has a similar handling of parameters that lead to model solution errors. In fit it is possible to set ignore_numerical_error = false and you will see the error message instead of having Pumas try to recover from the bad parameters.
MAXEVAL is a NONMEM setting that allows the user to restrict the number of objective function evaluations to allow. In Pumas, there is not a directly comparable version of this because of the way function calls are counted. It is possible to constrain the number of evaluations through the iterations counter that will restrict how many times we allow for an update of the search direction and a search for the next solution candidate. Similar to MAXEVAL=0 we can set optim_options = (;iterations=0,) and then only the empirical bayes estimates will be calculated, the fixed effects will not be optimized. A more direct approach is to work with functions such as loglikelihood, empirical_bayes, and predict.
LNTWOPI Is an option that allows for the inclusion fo the constant term N*LOG(2pi) where N is the number of normally distributed data. In Pumas we always include the constant.
PARAFILE=filename or PARAFILE=ON/PARAFILE=OFF turns parallelization on and off or set according to a specification in filename. In Pumas, we control the parallelization through the ensemblealg keyword to fit. The default is EnsembleThreads() but EnsembleSerial() (no parallelization) or EnsembleDistributed() (distributed computing) can also be set.
PRINT=n is similar to the show_every option mentioned earlier in this document.
SIGDIGITS=n is described as "Number of significant digits required in the final parameter estimate." in NONMEM. Defaults to 3. In Pumas, a similar setting is optim_options=(; x_reltol=1e-3,).

A setting that is set in $ESTIMATION in NONMEM that does not belong in optim_options in Pumas is ATOL. ATOL controls the absolute tolerance of the solution of the dynamical system. In NONMEM, the ATOL is deprecated $ESTIMATION as well and is meant to be set in $SUBROUTINE or $TOL instead. In Pumas, the place to set this setting is also in the differential equation options. This means that it should be set in diffeq_options and the default value is abstol=1e-12 which corresponds exactly to the default NONMEM setting of 12 (note, in NONMEM the exponent is set while in Pumas the number itself is set).

Other options to $ESTIMATION do not belong in the fit step of Pumas.

POSTHOC controls if empirical bayes estimates / ETAs are estimated or not after the fit is complete. In Pumas, use empirical_bayes and icoef functionality for this instead. These functions can be called in isolation or as part of the diagnostic function inspect that calculates and collects many quantities that are relevant to model diagnostics.
LIKELIHOOD allows the user to communicate that Y or F is to be interpreted in a special way, mostly for non-continuous observed responses. In Pumas, all aspect of the the likelihood contributions happen through the @derived error model specification, so the LIKELIHOOD option does not apply in Pumas. If an appropriate distribution does not exist it is possible to ask for it to be provided in Pumas or to implement it yourself as a distribution rather than just writing up some likelihood contribution.
-2LOGLIKELIHOOD as above.
NOCOV=[0|1] in $ESTIMATION and $COV settings that control covariance matrix calculations in general. In Pumas, this is not part if fit under any circumstance as we delegate this to a later part of the workflow using the infer function that often follows the inspect aspect of the workflow.

Marginal Likelihood Approximations

The following choices are available for the likelihood approximations:

FO(): first order approximation.
FOCE(): first order conditional estimation (with automatic detection for when to include the interaction terms).
LaplaceI(): second order Laplace approximation with interaction.

Only LaplaceI() is supported when using a PumasEMModel:

StatsAPI.fit — Method

fit(
  model::PumasEMModel,
  population::Population,
  param::NamedTuple,
  approx::Union{SAEM,LaplaceI};
  ensemblealg = EnsembleThreads(),
  rng = default_rng()
)

Fit the PumasEMModel to the dataset population using the initial values specified in param. The supported methods for the approx argument are currently SAEM and LaplaceI. See the online documentation for more details about these two methods.

When fitting with SAEM the variance terms of the random effects (Ω) and the dispersion parameters for the error model (σ) are initialized to the identity matrix or 1 as appropriate. They may also be specified. With SAEM, it is recommended to choose an init larger than the true values to facilitate exploration of the parameter space and avoid getting trapped in local optima early. Currently LaplaceI fits with a PumasEMModel require Ω to be diagonal.

Options for ensemblealg are EnsembleSerial() and EnsembleThreads() (the default). The fit will be parallel if ensemblealg == EnsembleThreads(). SAEM imposes the additional requirement for threaded fits that either the rng used supports Future.randjump or that rng be a vector of rngs, one per thread. The default_rng() supports randjump().

FittedPumasModel

The relevant fields of a FittedPumasModel are:

model: the model used in the estimation process.
data: the Population that was estimated.
optim: the result returned by the optimizer.
approx: the marginal likelihood approximation that was used.
param: the optimal parameters.

Bayesian Estimation

Bayesian parameter estimation is performed by using the fit function as follows:

StatsAPI.fit — Method

Distributions.fit(
  model::PumasModel,
  data::Population,
  param::NamedTuple,
  alg::BayesMCMC,
)

Samples from the posterior distribution of the model's population and subject-specific parameters given the observations in data using the Hamiltonian Monte Carlo (HMC) based No-U-Turn Sampler (NUTS) algorithm. The MCMC sampler is initialized starting from param for the population parameters. The subject-specific parameters are also apprioriately initialized. The Bayesian inference algorithm's options are passed in the alg struct. Run ?BayesMCMC for more details on the options that can be set.

Info

We use a NUTS sampler with the generalized no-U-turn termination criterion and multinomial sampling on an Hamiltonian system whose kinetic energy is specified with a diagonal metric (diagonal matrix with positive diagonal entries). For numerical integration of the Hamiltonian system, we use the ordinary leapfrog integrator. The adaptation is done using the Stan’s windowed adaptation routine with a default target acceptance ratio of 0.8.

The result is a BayesMCMCResults type.

`BayesMCMCResults`

The MCMC chain is stored in the chain field of the returned BayesMCMCResults. Additionally, the result can be converted into a Chains object from MCMCChains, allowing utilization of diagnostics and visualization tooling. This is discussed further in the Bayesian Workflow tutorial.

Maximum A Posteriori (MAP) Estimation

MAP estimation is carried out using the fit function and by wrapping the chosen likelihood approximation in the MAP type. The fit signature is the same as for maximum likelihood. If we have a model (model), a Population (data), and an initial parameter specification (param) we can carry out the MAP estimation with the FOCE approximation of the log-likelihood using

result = fit(model, data, param, MAP(FOCE()))

The usual post-processing and diagnostics are available and are calculated according to the approximation wrapped in MAP.