Data Representation

Pumas datasets are convenient, for example, allowing character values in fields which can add clarity to dataset content. Like NM-TRAN formatted datasets, they are organized with rows of data representing event records, with dose and observation events occurring on different records that are sorted in sequence of time within the records of an individual. Pumas supports most variables and other aspects of NM-TRAN formatted datasets. An important difference exists for datasets with multiple types of dependent variables (DV; e.g. a dataset with two analytes or both PK and PD observation types). NM-TRAN formatted datasets with multiple DV types adopt a long format, where the dependent variable values are stored in a single column. In contrast, Pumas datasets require the DV values of different data types to be in separate columns, i.e, a wide format. If there is a singe DV type in the dataset, the NM-TRAN formatted data should work quite easily in Pumas without modification. In the case of an NM-TRAN dataset with multiple DV types, the formatted dataset would have two columns to identify the 'DV' value, and it's type. The type can be defined by the CMT variable, or by a user provided variable such as DVID. In this NM-TRAN format: DV is the observation measurement value and DVID is type of the observation measurement, often an integer. For the Pumas dataset, a variable like DVID is not required since the DV values occur in different columns.

In Pumas, we can support multiple DVs, but it does not accept multiple DVs in a single column. Pumas expects that each DV has its own column. The operation of converting the NM-TRAN format to a Pumas compatible format is common in data wrangling workflows and is named pivot longer. When we have a dataset where one column has multiple observations, i.e. "long" format, we need to pivot it to a state where each observation has its own column, i.e. "wide" format.

To illustrate how to convert multiple DVs NM-TRAN dataset into Pumas, we will use the following example of a NM-TRAN formatted dataset which doesn't have DVID, but different compartments in the CMT column:

using DataFramesMeta

df = DataFrame(;
    ID = 1,
    TIME = repeat([0; 24:12:48; 72:24:120]; inner = 2),
    DV = [missing, 100.0, 9.2, 49.0, 8.5, 32.0, 6.4, 26.0, 4.8, 22.0, 3.1, 28.0, 2.5, 33.0],
    CMT = repeat([1, 2]; outer = 7),
    EVID = [1; repeat([0], 13)],
    AMT = [100; repeat([missing], 13)],
    WT = 66.7,
    AGE = 50,
    SEX = 1,
)
14×9 DataFrame
RowIDTIMEDVCMTEVIDAMTWTAGESEX
Int64Int64Float64?Int64Int64Int64?Float64Int64Int64
110missing1110066.7501
210100.020missing66.7501
31249.210missing66.7501
412449.020missing66.7501
51368.510missing66.7501
613632.020missing66.7501
71486.410missing66.7501
814826.020missing66.7501
91724.810missing66.7501
1017222.020missing66.7501
111963.110missing66.7501
1219628.020missing66.7501
1311202.510missing66.7501
14112033.020missing66.7501

We need to do some data wrangling before we convert the DataFrame into a wide format. The :CMT column needs to have positive amounts for dosing events (EVID == 1) and missing values for measurements events (EVID == 0). But before that, let's duplicate the :CMT as :DVID column, since it has the information of which DV type this row belongs to:

@transform! df :DVID = :CMT
14×10 DataFrame
RowIDTIMEDVCMTEVIDAMTWTAGESEXDVID
Int64Int64Float64?Int64Int64Int64?Float64Int64Int64Int64
110missing1110066.75011
210100.020missing66.75012
31249.210missing66.75011
412449.020missing66.75012
51368.510missing66.75011
613632.020missing66.75012
71486.410missing66.75011
814826.020missing66.75012
91724.810missing66.75011
1017222.020missing66.75012
111963.110missing66.75011
1219628.020missing66.75012
1311202.510missing66.75011
14112033.020missing66.75012

And we reassign the :CMT column values:

@rtransform! df :CMT = :EVID == 1 ? :CMT : missing
14×10 DataFrame
RowIDTIMEDVCMTEVIDAMTWTAGESEXDVID
Int64Int64Float64?Int64?Int64Int64?Float64Int64Int64Int64
110missing1110066.75011
210100.0missing0missing66.75012
31249.2missing0missing66.75011
412449.0missing0missing66.75012
51368.5missing0missing66.75011
613632.0missing0missing66.75012
71486.4missing0missing66.75011
814826.0missing0missing66.75012
91724.8missing0missing66.75011
1017222.0missing0missing66.75012
111963.1missing0missing66.75011
1219628.0missing0missing66.75012
1311202.5missing0missing66.75011
14112033.0missing0missing66.75012

To convert this DataFrame into a wide format that Pumas expects, you'll need to call the unstack function from the DataFrames.jl package. The first positional argument of unstack is the DataFrame that you'd want to "unstack" (make it wider). The second position argument is the rowkey columns. These are the columns with a unique key for each row. In our case this is just one column: :DVID. The third positional argument is the value columns. Analogously, these are the columns where the values of the variable to unstack are stored. In our case again this is just one column: :DV. Finally, you can also use an anonymous function to the keyword argument renamecols to specify how unstack will rename the new columns when converting your data to a wide format:

wide_df = unstack(df, :DVID, :DV; renamecols = x -> Symbol(:DV_, x))
8×10 DataFrame
RowIDTIMECMTEVIDAMTWTAGESEXDV_1DV_2
Int64Int64Int64?Int64Int64?Float64Int64Int64Float64?Float64?
1101110066.7501missingmissing
210missing0missing66.7501missing100.0
3124missing0missing66.75019.249.0
4136missing0missing66.75018.532.0
5148missing0missing66.75016.426.0
6172missing0missing66.75014.822.0
7196missing0missing66.75013.128.0
81120missing0missing66.75012.533.0

This would also work if the measurement times between the multiple DVs were mismatching. See this altered example and notice that the time value for the different DVs do not match anymore:

df = DataFrame(;
    ID = 1,
    TIME = [0, 0, 12, 24, 32, 36, 44, 48, 66, 72, 90, 96, 112, 120],
    DV = [missing, 100.0, 9.2, 49.0, 8.5, 32.0, 6.4, 26.0, 4.8, 22.0, 3.1, 28.0, 2.5, 33.0],
    CMT = repeat([1, 2]; outer = 7),
    EVID = [1; repeat([0], 13)],
    AMT = [100; repeat([missing], 13)],
    WT = 66.7,
    AGE = 50,
    SEX = 1,
)
14×9 DataFrame
RowIDTIMEDVCMTEVIDAMTWTAGESEX
Int64Int64Float64?Int64Int64Int64?Float64Int64Int64
110missing1110066.7501
210100.020missing66.7501
31129.210missing66.7501
412449.020missing66.7501
51328.510missing66.7501
613632.020missing66.7501
71446.410missing66.7501
814826.020missing66.7501
91664.810missing66.7501
1017222.020missing66.7501
111903.110missing66.7501
1219628.020missing66.7501
1311122.510missing66.7501
14112033.020missing66.7501
@chain df begin
    @transform! :DVID = :CMT
    @rtransform! :CMT = :EVID == 1 ? :CMT : missing
end
14×10 DataFrame
RowIDTIMEDVCMTEVIDAMTWTAGESEXDVID
Int64Int64Float64?Int64?Int64Int64?Float64Int64Int64Int64
110missing1110066.75011
210100.0missing0missing66.75012
31129.2missing0missing66.75011
412449.0missing0missing66.75012
51328.5missing0missing66.75011
613632.0missing0missing66.75012
71446.4missing0missing66.75011
814826.0missing0missing66.75012
91664.8missing0missing66.75011
1017222.0missing0missing66.75012
111903.1missing0missing66.75011
1219628.0missing0missing66.75012
1311122.5missing0missing66.75011
14112033.0missing0missing66.75012

Now we use the same unstack call as above:

wide_df = unstack(df, :DVID, :DV; renamecols = x -> Symbol(:DV_, x))
14×10 DataFrame
RowIDTIMECMTEVIDAMTWTAGESEXDV_1DV_2
Int64Int64Int64?Int64Int64?Float64Int64Int64Float64?Float64?
1101110066.7501missingmissing
210missing0missing66.7501missing100.0
3112missing0missing66.75019.2missing
4124missing0missing66.7501missing49.0
5132missing0missing66.75018.5missing
6136missing0missing66.7501missing32.0
7144missing0missing66.75016.4missing
8148missing0missing66.7501missing26.0
9166missing0missing66.75014.8missing
10172missing0missing66.7501missing22.0
11190missing0missing66.75013.1missing
12196missing0missing66.7501missing28.0
131112missing0missing66.75012.5missing
141120missing0missing66.7501missing33.0

This wide_df can be easily parsed into a Population using read_pumas:

using Pumas

pop = read_pumas(
    wide_df;
    id = :ID,
    time = :TIME,
    evid = :EVID,
    amt = :AMT,
    cmt = :CMT,
    observations = [:DV_1, :DV_2],
)
Population
  Subjects: 1
  Observations: DV_1, DV_2