Data Representation
Pumas datasets are convenient, for example, allowing character values in fields which can add clarity to dataset content. Like NM-TRAN formatted datasets, they are organized with rows of data representing event records, with dose and observation events occurring on different records that are sorted in sequence of time within the records of an individual. Pumas supports most variables and other aspects of NM-TRAN formatted datasets. An important difference exists for datasets with multiple types of dependent variables (DV; e.g. a dataset with two analytes or both PK and PD observation types). NM-TRAN formatted datasets with multiple DV types adopt a long format, where the dependent variable values are stored in a single column. In contrast, Pumas datasets require the DV values of different data types to be in separate columns, i.e, a wide format. If there is a singe DV type in the dataset, the NM-TRAN formatted data should work quite easily in Pumas without modification. In the case of an NM-TRAN dataset with multiple DV types, the formatted dataset would have two columns to identify the 'DV' value, and it's type. The type can be defined by the CMT variable, or by a user provided variable such as DVID
. In this NM-TRAN format: DV
is the observation measurement value and DVID
is type of the observation measurement, often an integer. For the Pumas dataset, a variable like DVID is not required since the DV values occur in different columns.
In Pumas, we can support multiple DV
s, but it does not accept multiple DV
s in a single column. Pumas expects that each DV
has its own column. The operation of converting the NM-TRAN format to a Pumas compatible format is common in data wrangling workflows and is named pivot longer. When we have a dataset where one column has multiple observations, i.e. "long" format, we need to pivot it to a state where each observation has its own column, i.e. "wide" format.
To illustrate how to convert multiple DV
s NM-TRAN dataset into Pumas, we will use the following example of a NM-TRAN formatted dataset which doesn't have DVID
, but different compartments in the CMT
column:
using DataFramesMeta
df = DataFrame(;
ID = 1,
TIME = repeat([0; 24:12:48; 72:24:120]; inner = 2),
DV = [missing, 100.0, 9.2, 49.0, 8.5, 32.0, 6.4, 26.0, 4.8, 22.0, 3.1, 28.0, 2.5, 33.0],
CMT = repeat([1, 2]; outer = 7),
EVID = [1; repeat([0], 13)],
AMT = [100; repeat([missing], 13)],
WT = 66.7,
AGE = 50,
SEX = 1,
)
Row | ID | TIME | DV | CMT | EVID | AMT | WT | AGE | SEX |
---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64? | Int64 | Int64 | Int64? | Float64 | Int64 | Int64 | |
1 | 1 | 0 | missing | 1 | 1 | 100 | 66.7 | 50 | 1 |
2 | 1 | 0 | 100.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
3 | 1 | 24 | 9.2 | 1 | 0 | missing | 66.7 | 50 | 1 |
4 | 1 | 24 | 49.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
5 | 1 | 36 | 8.5 | 1 | 0 | missing | 66.7 | 50 | 1 |
6 | 1 | 36 | 32.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
7 | 1 | 48 | 6.4 | 1 | 0 | missing | 66.7 | 50 | 1 |
8 | 1 | 48 | 26.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
9 | 1 | 72 | 4.8 | 1 | 0 | missing | 66.7 | 50 | 1 |
10 | 1 | 72 | 22.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
11 | 1 | 96 | 3.1 | 1 | 0 | missing | 66.7 | 50 | 1 |
12 | 1 | 96 | 28.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
13 | 1 | 120 | 2.5 | 1 | 0 | missing | 66.7 | 50 | 1 |
14 | 1 | 120 | 33.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
We need to do some data wrangling before we convert the DataFrame
into a wide format. The :CMT
column needs to have positive amounts for dosing events (EVID == 1
) and missing
values for measurements events (EVID == 0
). But before that, let's duplicate the :CMT
as :DVID
column, since it has the information of which DV
type this row belongs to:
@transform! df :DVID = :CMT
Row | ID | TIME | DV | CMT | EVID | AMT | WT | AGE | SEX | DVID |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64? | Int64 | Int64 | Int64? | Float64 | Int64 | Int64 | Int64 | |
1 | 1 | 0 | missing | 1 | 1 | 100 | 66.7 | 50 | 1 | 1 |
2 | 1 | 0 | 100.0 | 2 | 0 | missing | 66.7 | 50 | 1 | 2 |
3 | 1 | 24 | 9.2 | 1 | 0 | missing | 66.7 | 50 | 1 | 1 |
4 | 1 | 24 | 49.0 | 2 | 0 | missing | 66.7 | 50 | 1 | 2 |
5 | 1 | 36 | 8.5 | 1 | 0 | missing | 66.7 | 50 | 1 | 1 |
6 | 1 | 36 | 32.0 | 2 | 0 | missing | 66.7 | 50 | 1 | 2 |
7 | 1 | 48 | 6.4 | 1 | 0 | missing | 66.7 | 50 | 1 | 1 |
8 | 1 | 48 | 26.0 | 2 | 0 | missing | 66.7 | 50 | 1 | 2 |
9 | 1 | 72 | 4.8 | 1 | 0 | missing | 66.7 | 50 | 1 | 1 |
10 | 1 | 72 | 22.0 | 2 | 0 | missing | 66.7 | 50 | 1 | 2 |
11 | 1 | 96 | 3.1 | 1 | 0 | missing | 66.7 | 50 | 1 | 1 |
12 | 1 | 96 | 28.0 | 2 | 0 | missing | 66.7 | 50 | 1 | 2 |
13 | 1 | 120 | 2.5 | 1 | 0 | missing | 66.7 | 50 | 1 | 1 |
14 | 1 | 120 | 33.0 | 2 | 0 | missing | 66.7 | 50 | 1 | 2 |
And we reassign the :CMT
column values:
@rtransform! df :CMT = :EVID == 1 ? :CMT : missing
Row | ID | TIME | DV | CMT | EVID | AMT | WT | AGE | SEX | DVID |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64? | Int64? | Int64 | Int64? | Float64 | Int64 | Int64 | Int64 | |
1 | 1 | 0 | missing | 1 | 1 | 100 | 66.7 | 50 | 1 | 1 |
2 | 1 | 0 | 100.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
3 | 1 | 24 | 9.2 | missing | 0 | missing | 66.7 | 50 | 1 | 1 |
4 | 1 | 24 | 49.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
5 | 1 | 36 | 8.5 | missing | 0 | missing | 66.7 | 50 | 1 | 1 |
6 | 1 | 36 | 32.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
7 | 1 | 48 | 6.4 | missing | 0 | missing | 66.7 | 50 | 1 | 1 |
8 | 1 | 48 | 26.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
9 | 1 | 72 | 4.8 | missing | 0 | missing | 66.7 | 50 | 1 | 1 |
10 | 1 | 72 | 22.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
11 | 1 | 96 | 3.1 | missing | 0 | missing | 66.7 | 50 | 1 | 1 |
12 | 1 | 96 | 28.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
13 | 1 | 120 | 2.5 | missing | 0 | missing | 66.7 | 50 | 1 | 1 |
14 | 1 | 120 | 33.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
To convert this DataFrame
into a wide format that Pumas expects, you'll need to call the unstack
function from the DataFrames.jl
package. The first positional argument of unstack
is the DataFrame
that you'd want to "unstack" (make it wider). The second position argument is the rowkey columns. These are the columns with a unique key for each row. In our case this is just one column: :DVID
. The third positional argument is the value columns. Analogously, these are the columns where the values of the variable to unstack are stored. In our case again this is just one column: :DV
. Finally, you can also use an anonymous function to the keyword argument renamecols
to specify how unstack
will rename the new columns when converting your data to a wide format:
wide_df = unstack(df, :DVID, :DV; renamecols = x -> Symbol(:DV_, x))
Row | ID | TIME | CMT | EVID | AMT | WT | AGE | SEX | DV_1 | DV_2 |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Int64? | Int64 | Int64? | Float64 | Int64 | Int64 | Float64? | Float64? | |
1 | 1 | 0 | 1 | 1 | 100 | 66.7 | 50 | 1 | missing | missing |
2 | 1 | 0 | missing | 0 | missing | 66.7 | 50 | 1 | missing | 100.0 |
3 | 1 | 24 | missing | 0 | missing | 66.7 | 50 | 1 | 9.2 | 49.0 |
4 | 1 | 36 | missing | 0 | missing | 66.7 | 50 | 1 | 8.5 | 32.0 |
5 | 1 | 48 | missing | 0 | missing | 66.7 | 50 | 1 | 6.4 | 26.0 |
6 | 1 | 72 | missing | 0 | missing | 66.7 | 50 | 1 | 4.8 | 22.0 |
7 | 1 | 96 | missing | 0 | missing | 66.7 | 50 | 1 | 3.1 | 28.0 |
8 | 1 | 120 | missing | 0 | missing | 66.7 | 50 | 1 | 2.5 | 33.0 |
This would also work if the measurement times between the multiple DV
s were mismatching. See this altered example and notice that the time value for the different DV
s do not match anymore:
df = DataFrame(;
ID = 1,
TIME = [0, 0, 12, 24, 32, 36, 44, 48, 66, 72, 90, 96, 112, 120],
DV = [missing, 100.0, 9.2, 49.0, 8.5, 32.0, 6.4, 26.0, 4.8, 22.0, 3.1, 28.0, 2.5, 33.0],
CMT = repeat([1, 2]; outer = 7),
EVID = [1; repeat([0], 13)],
AMT = [100; repeat([missing], 13)],
WT = 66.7,
AGE = 50,
SEX = 1,
)
Row | ID | TIME | DV | CMT | EVID | AMT | WT | AGE | SEX |
---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64? | Int64 | Int64 | Int64? | Float64 | Int64 | Int64 | |
1 | 1 | 0 | missing | 1 | 1 | 100 | 66.7 | 50 | 1 |
2 | 1 | 0 | 100.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
3 | 1 | 12 | 9.2 | 1 | 0 | missing | 66.7 | 50 | 1 |
4 | 1 | 24 | 49.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
5 | 1 | 32 | 8.5 | 1 | 0 | missing | 66.7 | 50 | 1 |
6 | 1 | 36 | 32.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
7 | 1 | 44 | 6.4 | 1 | 0 | missing | 66.7 | 50 | 1 |
8 | 1 | 48 | 26.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
9 | 1 | 66 | 4.8 | 1 | 0 | missing | 66.7 | 50 | 1 |
10 | 1 | 72 | 22.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
11 | 1 | 90 | 3.1 | 1 | 0 | missing | 66.7 | 50 | 1 |
12 | 1 | 96 | 28.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
13 | 1 | 112 | 2.5 | 1 | 0 | missing | 66.7 | 50 | 1 |
14 | 1 | 120 | 33.0 | 2 | 0 | missing | 66.7 | 50 | 1 |
@chain df begin
@transform! :DVID = :CMT
@rtransform! :CMT = :EVID == 1 ? :CMT : missing
end
Row | ID | TIME | DV | CMT | EVID | AMT | WT | AGE | SEX | DVID |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64? | Int64? | Int64 | Int64? | Float64 | Int64 | Int64 | Int64 | |
1 | 1 | 0 | missing | 1 | 1 | 100 | 66.7 | 50 | 1 | 1 |
2 | 1 | 0 | 100.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
3 | 1 | 12 | 9.2 | missing | 0 | missing | 66.7 | 50 | 1 | 1 |
4 | 1 | 24 | 49.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
5 | 1 | 32 | 8.5 | missing | 0 | missing | 66.7 | 50 | 1 | 1 |
6 | 1 | 36 | 32.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
7 | 1 | 44 | 6.4 | missing | 0 | missing | 66.7 | 50 | 1 | 1 |
8 | 1 | 48 | 26.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
9 | 1 | 66 | 4.8 | missing | 0 | missing | 66.7 | 50 | 1 | 1 |
10 | 1 | 72 | 22.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
11 | 1 | 90 | 3.1 | missing | 0 | missing | 66.7 | 50 | 1 | 1 |
12 | 1 | 96 | 28.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
13 | 1 | 112 | 2.5 | missing | 0 | missing | 66.7 | 50 | 1 | 1 |
14 | 1 | 120 | 33.0 | missing | 0 | missing | 66.7 | 50 | 1 | 2 |
Now we use the same unstack
call as above:
wide_df = unstack(df, :DVID, :DV; renamecols = x -> Symbol(:DV_, x))
Row | ID | TIME | CMT | EVID | AMT | WT | AGE | SEX | DV_1 | DV_2 |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Int64? | Int64 | Int64? | Float64 | Int64 | Int64 | Float64? | Float64? | |
1 | 1 | 0 | 1 | 1 | 100 | 66.7 | 50 | 1 | missing | missing |
2 | 1 | 0 | missing | 0 | missing | 66.7 | 50 | 1 | missing | 100.0 |
3 | 1 | 12 | missing | 0 | missing | 66.7 | 50 | 1 | 9.2 | missing |
4 | 1 | 24 | missing | 0 | missing | 66.7 | 50 | 1 | missing | 49.0 |
5 | 1 | 32 | missing | 0 | missing | 66.7 | 50 | 1 | 8.5 | missing |
6 | 1 | 36 | missing | 0 | missing | 66.7 | 50 | 1 | missing | 32.0 |
7 | 1 | 44 | missing | 0 | missing | 66.7 | 50 | 1 | 6.4 | missing |
8 | 1 | 48 | missing | 0 | missing | 66.7 | 50 | 1 | missing | 26.0 |
9 | 1 | 66 | missing | 0 | missing | 66.7 | 50 | 1 | 4.8 | missing |
10 | 1 | 72 | missing | 0 | missing | 66.7 | 50 | 1 | missing | 22.0 |
11 | 1 | 90 | missing | 0 | missing | 66.7 | 50 | 1 | 3.1 | missing |
12 | 1 | 96 | missing | 0 | missing | 66.7 | 50 | 1 | missing | 28.0 |
13 | 1 | 112 | missing | 0 | missing | 66.7 | 50 | 1 | 2.5 | missing |
14 | 1 | 120 | missing | 0 | missing | 66.7 | 50 | 1 | missing | 33.0 |
This wide_df
can be easily parsed into a Population
using read_pumas
:
using Pumas
pop = read_pumas(
wide_df;
id = :ID,
time = :TIME,
evid = :EVID,
amt = :AMT,
cmt = :CMT,
observations = [:DV_1, :DV_2],
)
Population
Subjects: 1
Observations: DV_1, DV_2