Data Representation
Pumas datasets are convenient, for example, allowing character values in fields which can add clarity to dataset content. Like NM-TRAN formatted datasets, they are organized with rows of data representing event records, with dose and observation events occurring on different records that are sorted in sequence of time within the records of an individual. Pumas supports most variables and other aspects of NM-TRAN formatted datasets. An important difference exists for datasets with multiple types of dependent variables (DV; e.g. a dataset with two analytes or both PK and PD observation types). NM-TRAN formatted datasets with multiple DV types adopt a long format, where the dependent variable values are stored in a single column. In contrast, Pumas datasets require the DV values of different data types to be in separate columns, i.e, a wide format. If there is a singe DV type in the dataset, the NM-TRAN formatted data should work quite easily in Pumas without modification. In the case of an NM-TRAN dataset with multiple DV types, the formatted dataset would have two columns to identify the 'DV' value, and it's type. The type can be defined by the CMT variable, or by a user provided variable such as DVID. In this NM-TRAN format: DV is the observation measurement value and DVID is type of the observation measurement, often an integer. For the Pumas dataset, a variable like DVID is not required since the DV values occur in different columns.
In Pumas, we can support multiple DVs, but it does not accept multiple DVs in a single column. Pumas expects that each DV has its own column. The operation of converting the NM-TRAN format to a Pumas compatible format is common in data wrangling workflows and is named pivot longer. When we have a dataset where one column has multiple observations, i.e. "long" format, we need to pivot it to a state where each observation has its own column, i.e. "wide" format.
To illustrate how to convert multiple DVs NM-TRAN dataset into Pumas, we will use the following example of a NM-TRAN formatted dataset which doesn't have DVID, but different compartments in the CMT column:
using DataFramesMeta
df = DataFrame(;
ID = 1,
TIME = repeat([0; 24:12:48; 72:24:120]; inner = 2),
DV = [missing, 100.0, 9.2, 49.0, 8.5, 32.0, 6.4, 26.0, 4.8, 22.0, 3.1, 28.0, 2.5, 33.0],
CMT = repeat([1, 2]; outer = 7),
EVID = [1; repeat([0], 13)],
AMT = [100; repeat([missing], 13)],
WT = 66.7,
AGE = 50,
SEX = 1,
)| Row | ID | TIME | DV | CMT | EVID | AMT | WT | AGE | SEX |
|---|---|---|---|---|---|---|---|---|---|
| Int64 | Int64 | Float64? | Int64 | Int64 | Int64? | Float64 | Int64 | Int64 | |
| 1 | 1 | 0 | missing | 1 | 1 | 100 | 66.7 | 50 | 1 |
| 2 | 1 | 0 | 100.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
| 3 | 1 | 24 | 9.2 | 1 | 0 | missing | 66.7 | 50 | 1 |
| 4 | 1 | 24 | 49.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
| 5 | 1 | 36 | 8.5 | 1 | 0 | missing | 66.7 | 50 | 1 |
| 6 | 1 | 36 | 32.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
| 7 | 1 | 48 | 6.4 | 1 | 0 | missing | 66.7 | 50 | 1 |
| 8 | 1 | 48 | 26.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
| 9 | 1 | 72 | 4.8 | 1 | 0 | missing | 66.7 | 50 | 1 |
| 10 | 1 | 72 | 22.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
| 11 | 1 | 96 | 3.1 | 1 | 0 | missing | 66.7 | 50 | 1 |
| 12 | 1 | 96 | 28.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
| 13 | 1 | 120 | 2.5 | 1 | 0 | missing | 66.7 | 50 | 1 |
| 14 | 1 | 120 | 33.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
We need to do some data wrangling before we convert the DataFrame into a wide format. The :CMT column needs to have positive amounts for dosing events (EVID == 1) and missing values for measurements events (EVID == 0). But before that, let's duplicate the :CMT as :DVID column, since it has the information of which DV type this row belongs to:
@transform! df :DVID = :CMT| Row | ID | TIME | DV | CMT | EVID | AMT | WT | AGE | SEX | DVID |
|---|---|---|---|---|---|---|---|---|---|---|
| Int64 | Int64 | Float64? | Int64 | Int64 | Int64? | Float64 | Int64 | Int64 | Int64 | |
| 1 | 1 | 0 | missing | 1 | 1 | 100 | 66.7 | 50 | 1 | 1 |
| 2 | 1 | 0 | 100.0 | 2 | 0 | missing | 66.7 | 50 | 1 | 2 |
| 3 | 1 | 24 | 9.2 | 1 | 0 | missing | 66.7 | 50 | 1 | 1 |
| 4 | 1 | 24 | 49.0 | 2 | 0 | missing | 66.7 | 50 | 1 | 2 |
| 5 | 1 | 36 | 8.5 | 1 | 0 | missing | 66.7 | 50 | 1 | 1 |
| 6 | 1 | 36 | 32.0 | 2 | 0 | missing | 66.7 | 50 | 1 | 2 |
| 7 | 1 | 48 | 6.4 | 1 | 0 | missing | 66.7 | 50 | 1 | 1 |
| 8 | 1 | 48 | 26.0 | 2 | 0 | missing | 66.7 | 50 | 1 | 2 |
| 9 | 1 | 72 | 4.8 | 1 | 0 | missing | 66.7 | 50 | 1 | 1 |
| 10 | 1 | 72 | 22.0 | 2 | 0 | missing | 66.7 | 50 | 1 | 2 |
| 11 | 1 | 96 | 3.1 | 1 | 0 | missing | 66.7 | 50 | 1 | 1 |
| 12 | 1 | 96 | 28.0 | 2 | 0 | missing | 66.7 | 50 | 1 | 2 |
| 13 | 1 | 120 | 2.5 | 1 | 0 | missing | 66.7 | 50 | 1 | 1 |
| 14 | 1 | 120 | 33.0 | 2 | 0 | missing | 66.7 | 50 | 1 | 2 |
And we reassign the :CMT column values:
@rtransform! df :CMT = :EVID == 1 ? :CMT : missing| Row | ID | TIME | DV | CMT | EVID | AMT | WT | AGE | SEX | DVID |
|---|---|---|---|---|---|---|---|---|---|---|
| Int64 | Int64 | Float64? | Int64? | Int64 | Int64? | Float64 | Int64 | Int64 | Int64 | |
| 1 | 1 | 0 | missing | 1 | 1 | 100 | 66.7 | 50 | 1 | 1 |
| 2 | 1 | 0 | 100.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
| 3 | 1 | 24 | 9.2 | missing | 0 | missing | 66.7 | 50 | 1 | 1 |
| 4 | 1 | 24 | 49.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
| 5 | 1 | 36 | 8.5 | missing | 0 | missing | 66.7 | 50 | 1 | 1 |
| 6 | 1 | 36 | 32.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
| 7 | 1 | 48 | 6.4 | missing | 0 | missing | 66.7 | 50 | 1 | 1 |
| 8 | 1 | 48 | 26.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
| 9 | 1 | 72 | 4.8 | missing | 0 | missing | 66.7 | 50 | 1 | 1 |
| 10 | 1 | 72 | 22.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
| 11 | 1 | 96 | 3.1 | missing | 0 | missing | 66.7 | 50 | 1 | 1 |
| 12 | 1 | 96 | 28.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
| 13 | 1 | 120 | 2.5 | missing | 0 | missing | 66.7 | 50 | 1 | 1 |
| 14 | 1 | 120 | 33.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
To convert this DataFrame into a wide format that Pumas expects, you'll need to call the unstack function from the DataFrames.jl package. The first positional argument of unstack is the DataFrame that you'd want to "unstack" (make it wider). The second position argument is the rowkey columns. These are the columns with a unique key for each row. In our case this is just one column: :DVID. The third positional argument is the value columns. Analogously, these are the columns where the values of the variable to unstack are stored. In our case again this is just one column: :DV. Finally, you can also use an anonymous function to the keyword argument renamecols to specify how unstack will rename the new columns when converting your data to a wide format:
wide_df = unstack(df, :DVID, :DV; renamecols = x -> Symbol(:DV_, x))| Row | ID | TIME | CMT | EVID | AMT | WT | AGE | SEX | DV_1 | DV_2 |
|---|---|---|---|---|---|---|---|---|---|---|
| Int64 | Int64 | Int64? | Int64 | Int64? | Float64 | Int64 | Int64 | Float64? | Float64? | |
| 1 | 1 | 0 | 1 | 1 | 100 | 66.7 | 50 | 1 | missing | missing |
| 2 | 1 | 0 | missing | 0 | missing | 66.7 | 50 | 1 | missing | 100.0 |
| 3 | 1 | 24 | missing | 0 | missing | 66.7 | 50 | 1 | 9.2 | 49.0 |
| 4 | 1 | 36 | missing | 0 | missing | 66.7 | 50 | 1 | 8.5 | 32.0 |
| 5 | 1 | 48 | missing | 0 | missing | 66.7 | 50 | 1 | 6.4 | 26.0 |
| 6 | 1 | 72 | missing | 0 | missing | 66.7 | 50 | 1 | 4.8 | 22.0 |
| 7 | 1 | 96 | missing | 0 | missing | 66.7 | 50 | 1 | 3.1 | 28.0 |
| 8 | 1 | 120 | missing | 0 | missing | 66.7 | 50 | 1 | 2.5 | 33.0 |
This would also work if the measurement times between the multiple DVs were mismatching. See this altered example and notice that the time value for the different DVs do not match anymore:
df = DataFrame(;
ID = 1,
TIME = [0, 0, 12, 24, 32, 36, 44, 48, 66, 72, 90, 96, 112, 120],
DV = [missing, 100.0, 9.2, 49.0, 8.5, 32.0, 6.4, 26.0, 4.8, 22.0, 3.1, 28.0, 2.5, 33.0],
CMT = repeat([1, 2]; outer = 7),
EVID = [1; repeat([0], 13)],
AMT = [100; repeat([missing], 13)],
WT = 66.7,
AGE = 50,
SEX = 1,
)| Row | ID | TIME | DV | CMT | EVID | AMT | WT | AGE | SEX |
|---|---|---|---|---|---|---|---|---|---|
| Int64 | Int64 | Float64? | Int64 | Int64 | Int64? | Float64 | Int64 | Int64 | |
| 1 | 1 | 0 | missing | 1 | 1 | 100 | 66.7 | 50 | 1 |
| 2 | 1 | 0 | 100.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
| 3 | 1 | 12 | 9.2 | 1 | 0 | missing | 66.7 | 50 | 1 |
| 4 | 1 | 24 | 49.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
| 5 | 1 | 32 | 8.5 | 1 | 0 | missing | 66.7 | 50 | 1 |
| 6 | 1 | 36 | 32.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
| 7 | 1 | 44 | 6.4 | 1 | 0 | missing | 66.7 | 50 | 1 |
| 8 | 1 | 48 | 26.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
| 9 | 1 | 66 | 4.8 | 1 | 0 | missing | 66.7 | 50 | 1 |
| 10 | 1 | 72 | 22.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
| 11 | 1 | 90 | 3.1 | 1 | 0 | missing | 66.7 | 50 | 1 |
| 12 | 1 | 96 | 28.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
| 13 | 1 | 112 | 2.5 | 1 | 0 | missing | 66.7 | 50 | 1 |
| 14 | 1 | 120 | 33.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
@chain df begin
@transform! :DVID = :CMT
@rtransform! :CMT = :EVID == 1 ? :CMT : missing
end| Row | ID | TIME | DV | CMT | EVID | AMT | WT | AGE | SEX | DVID |
|---|---|---|---|---|---|---|---|---|---|---|
| Int64 | Int64 | Float64? | Int64? | Int64 | Int64? | Float64 | Int64 | Int64 | Int64 | |
| 1 | 1 | 0 | missing | 1 | 1 | 100 | 66.7 | 50 | 1 | 1 |
| 2 | 1 | 0 | 100.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
| 3 | 1 | 12 | 9.2 | missing | 0 | missing | 66.7 | 50 | 1 | 1 |
| 4 | 1 | 24 | 49.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
| 5 | 1 | 32 | 8.5 | missing | 0 | missing | 66.7 | 50 | 1 | 1 |
| 6 | 1 | 36 | 32.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
| 7 | 1 | 44 | 6.4 | missing | 0 | missing | 66.7 | 50 | 1 | 1 |
| 8 | 1 | 48 | 26.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
| 9 | 1 | 66 | 4.8 | missing | 0 | missing | 66.7 | 50 | 1 | 1 |
| 10 | 1 | 72 | 22.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
| 11 | 1 | 90 | 3.1 | missing | 0 | missing | 66.7 | 50 | 1 | 1 |
| 12 | 1 | 96 | 28.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
| 13 | 1 | 112 | 2.5 | missing | 0 | missing | 66.7 | 50 | 1 | 1 |
| 14 | 1 | 120 | 33.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
Now we use the same unstack call as above:
wide_df = unstack(df, :DVID, :DV; renamecols = x -> Symbol(:DV_, x))| Row | ID | TIME | CMT | EVID | AMT | WT | AGE | SEX | DV_1 | DV_2 |
|---|---|---|---|---|---|---|---|---|---|---|
| Int64 | Int64 | Int64? | Int64 | Int64? | Float64 | Int64 | Int64 | Float64? | Float64? | |
| 1 | 1 | 0 | 1 | 1 | 100 | 66.7 | 50 | 1 | missing | missing |
| 2 | 1 | 0 | missing | 0 | missing | 66.7 | 50 | 1 | missing | 100.0 |
| 3 | 1 | 12 | missing | 0 | missing | 66.7 | 50 | 1 | 9.2 | missing |
| 4 | 1 | 24 | missing | 0 | missing | 66.7 | 50 | 1 | missing | 49.0 |
| 5 | 1 | 32 | missing | 0 | missing | 66.7 | 50 | 1 | 8.5 | missing |
| 6 | 1 | 36 | missing | 0 | missing | 66.7 | 50 | 1 | missing | 32.0 |
| 7 | 1 | 44 | missing | 0 | missing | 66.7 | 50 | 1 | 6.4 | missing |
| 8 | 1 | 48 | missing | 0 | missing | 66.7 | 50 | 1 | missing | 26.0 |
| 9 | 1 | 66 | missing | 0 | missing | 66.7 | 50 | 1 | 4.8 | missing |
| 10 | 1 | 72 | missing | 0 | missing | 66.7 | 50 | 1 | missing | 22.0 |
| 11 | 1 | 90 | missing | 0 | missing | 66.7 | 50 | 1 | 3.1 | missing |
| 12 | 1 | 96 | missing | 0 | missing | 66.7 | 50 | 1 | missing | 28.0 |
| 13 | 1 | 112 | missing | 0 | missing | 66.7 | 50 | 1 | 2.5 | missing |
| 14 | 1 | 120 | missing | 0 | missing | 66.7 | 50 | 1 | missing | 33.0 |
This wide_df can be easily parsed into a Population using read_pumas:
using Pumas
pop = read_pumas(
wide_df;
id = :ID,
time = :TIME,
evid = :EVID,
amt = :AMT,
cmt = :CMT,
observations = [:DV_1, :DV_2],
)Population
Subjects: 1
Observations: DV_1, DV_2