Data Representation

Pumas datasets are convenient, for example, allowing character values in fields which can add clarity to dataset content. Like NM-TRAN formatted datasets, they are organized with rows of data representing event records, with dose and observation events occurring on different records that are sorted in sequence of time within the records of an individual. Pumas supports most variables and other aspects of NM-TRAN formatted datasets. An important difference exists for datasets with multiple types of dependent variables (DV; e.g. a dataset with two analytes or both PK and PD observation types). NM-TRAN formatted datasets with multiple DV types adopt a long format, where the dependent variable values are stored in a single column. In contrast, Pumas datasets require the DV values of different data types to be in separate columns, i.e, a wide format. If there is a singe DV type in the dataset, the NM-TRAN formatted data should work quite easily in Pumas without modification. In the case of an NM-TRAN dataset with multiple DV types, the formatted dataset would have two columns to identify the 'DV' value, and it's type. The type can be defined by the CMT variable, or by a user provided variable such as DVID. In this NM-TRAN format: DV is the observation measurement value and DVID is type of the observation measurement, often an integer. For the Pumas dataset, a variable like DVID is not required since the DV values occur in different columns.

In Pumas, we can support multiple DVs, but it does not accept multiple DVs in a single column. Pumas expects that each DV has its own column. The operation of converting the NM-TRAN format to a Pumas compatible format is common in data wrangling workflows and is named pivot longer. When we have a dataset where one column has multiple observations, i.e. "long" format, we need to pivot it to a state where each observation has its own column, i.e. "wide" format.

To illustrate how to convert multiple DVs NM-TRAN dataset into Pumas, we will use the following example of a NM-TRAN formatted dataset which doesn't have DVID, but different compartments in the CMT column:

using DataFramesMeta

df = DataFrame(;
    ID = 1,
    TIME = repeat([0; 24:12:48; 72:24:120]; inner = 2),
    DV = [missing, 100.0, 9.2, 49.0, 8.5, 32.0, 6.4, 26.0, 4.8, 22.0, 3.1, 28.0, 2.5, 33.0],
    CMT = repeat([1, 2]; outer = 7),
    EVID = [1; repeat([0], 13)],
    AMT = [100; repeat([missing], 13)],
    WT = 66.7,
    AGE = 50,
    SEX = 1,
)

14×9 DataFrame

Row	ID	TIME	DV	CMT	EVID	AMT	WT	AGE	SEX
	Int64	Int64	Float64?	Int64	Int64	Int64?	Float64	Int64	Int64
1	1	0	missing	1	1	100	66.7	50	1
2	1	0	100.0	2	0	missing	66.7	50	1
3	1	24	9.2	1	0	missing	66.7	50	1
4	1	24	49.0	2	0	missing	66.7	50	1
5	1	36	8.5	1	0	missing	66.7	50	1
6	1	36	32.0	2	0	missing	66.7	50	1
7	1	48	6.4	1	0	missing	66.7	50	1
8	1	48	26.0	2	0	missing	66.7	50	1
9	1	72	4.8	1	0	missing	66.7	50	1
10	1	72	22.0	2	0	missing	66.7	50	1
11	1	96	3.1	1	0	missing	66.7	50	1
12	1	96	28.0	2	0	missing	66.7	50	1
13	1	120	2.5	1	0	missing	66.7	50	1
14	1	120	33.0	2	0	missing	66.7	50	1

We need to do some data wrangling before we convert the DataFrame into a wide format. The :CMT column needs to have positive amounts for dosing events (EVID == 1) and missing values for measurements events (EVID == 0). But before that, let's duplicate the :CMT as :DVID column, since it has the information of which DV type this row belongs to:

@transform! df :DVID = :CMT

14×10 DataFrame

Row	ID	TIME	DV	CMT	EVID	AMT	WT	AGE	SEX	DVID
	Int64	Int64	Float64?	Int64	Int64	Int64?	Float64	Int64	Int64	Int64
1	1	0	missing	1	1	100	66.7	50	1	1
2	1	0	100.0	2	0	missing	66.7	50	1	2
3	1	24	9.2	1	0	missing	66.7	50	1	1
4	1	24	49.0	2	0	missing	66.7	50	1	2
5	1	36	8.5	1	0	missing	66.7	50	1	1
6	1	36	32.0	2	0	missing	66.7	50	1	2
7	1	48	6.4	1	0	missing	66.7	50	1	1
8	1	48	26.0	2	0	missing	66.7	50	1	2
9	1	72	4.8	1	0	missing	66.7	50	1	1
10	1	72	22.0	2	0	missing	66.7	50	1	2
11	1	96	3.1	1	0	missing	66.7	50	1	1
12	1	96	28.0	2	0	missing	66.7	50	1	2
13	1	120	2.5	1	0	missing	66.7	50	1	1
14	1	120	33.0	2	0	missing	66.7	50	1	2

And we reassign the :CMT column values:

@rtransform! df :CMT = :EVID == 1 ? :CMT : missing

14×10 DataFrame

Row	ID	TIME	DV	CMT	EVID	AMT	WT	AGE	SEX	DVID
	Int64	Int64	Float64?	Int64?	Int64	Int64?	Float64	Int64	Int64	Int64
1	1	0	missing	1	1	100	66.7	50	1	1
2	1	0	100.0	missing	0	missing	66.7	50	1	2
3	1	24	9.2	missing	0	missing	66.7	50	1	1
4	1	24	49.0	missing	0	missing	66.7	50	1	2
5	1	36	8.5	missing	0	missing	66.7	50	1	1
6	1	36	32.0	missing	0	missing	66.7	50	1	2
7	1	48	6.4	missing	0	missing	66.7	50	1	1
8	1	48	26.0	missing	0	missing	66.7	50	1	2
9	1	72	4.8	missing	0	missing	66.7	50	1	1
10	1	72	22.0	missing	0	missing	66.7	50	1	2
11	1	96	3.1	missing	0	missing	66.7	50	1	1
12	1	96	28.0	missing	0	missing	66.7	50	1	2
13	1	120	2.5	missing	0	missing	66.7	50	1	1
14	1	120	33.0	missing	0	missing	66.7	50	1	2

To convert this DataFrame into a wide format that Pumas expects, you'll need to call the unstack function from the DataFrames.jl package. The first positional argument of unstack is the DataFrame that you'd want to "unstack" (make it wider). The second position argument is the rowkey columns. These are the columns with a unique key for each row. In our case this is just one column: :DVID. The third positional argument is the value columns. Analogously, these are the columns where the values of the variable to unstack are stored. In our case again this is just one column: :DV. Finally, you can also use an anonymous function to the keyword argument renamecols to specify how unstack will rename the new columns when converting your data to a wide format:

wide_df = unstack(df, :DVID, :DV; renamecols = x -> Symbol(:DV_, x))

8×10 DataFrame

Row	ID	TIME	CMT	EVID	AMT	WT	AGE	SEX	DV_1	DV_2
	Int64	Int64	Int64?	Int64	Int64?	Float64	Int64	Int64	Float64?	Float64?
1	1	0	1	1	100	66.7	50	1	missing	missing
2	1	0	missing	0	missing	66.7	50	1	missing	100.0
3	1	24	missing	0	missing	66.7	50	1	9.2	49.0
4	1	36	missing	0	missing	66.7	50	1	8.5	32.0
5	1	48	missing	0	missing	66.7	50	1	6.4	26.0
6	1	72	missing	0	missing	66.7	50	1	4.8	22.0
7	1	96	missing	0	missing	66.7	50	1	3.1	28.0
8	1	120	missing	0	missing	66.7	50	1	2.5	33.0

This would also work if the measurement times between the multiple DVs were mismatching. See this altered example and notice that the time value for the different DVs do not match anymore:

df = DataFrame(;
    ID = 1,
    TIME = [0, 0, 12, 24, 32, 36, 44, 48, 66, 72, 90, 96, 112, 120],
    DV = [missing, 100.0, 9.2, 49.0, 8.5, 32.0, 6.4, 26.0, 4.8, 22.0, 3.1, 28.0, 2.5, 33.0],
    CMT = repeat([1, 2]; outer = 7),
    EVID = [1; repeat([0], 13)],
    AMT = [100; repeat([missing], 13)],
    WT = 66.7,
    AGE = 50,
    SEX = 1,
)

14×9 DataFrame

Row	ID	TIME	DV	CMT	EVID	AMT	WT	AGE	SEX
	Int64	Int64	Float64?	Int64	Int64	Int64?	Float64	Int64	Int64
1	1	0	missing	1	1	100	66.7	50	1
2	1	0	100.0	2	0	missing	66.7	50	1
3	1	12	9.2	1	0	missing	66.7	50	1
4	1	24	49.0	2	0	missing	66.7	50	1
5	1	32	8.5	1	0	missing	66.7	50	1
6	1	36	32.0	2	0	missing	66.7	50	1
7	1	44	6.4	1	0	missing	66.7	50	1
8	1	48	26.0	2	0	missing	66.7	50	1
9	1	66	4.8	1	0	missing	66.7	50	1
10	1	72	22.0	2	0	missing	66.7	50	1
11	1	90	3.1	1	0	missing	66.7	50	1
12	1	96	28.0	2	0	missing	66.7	50	1
13	1	112	2.5	1	0	missing	66.7	50	1
14	1	120	33.0	2	0	missing	66.7	50	1

@chain df begin
    @transform! :DVID = :CMT
    @rtransform! :CMT = :EVID == 1 ? :CMT : missing
end

14×10 DataFrame

Row	ID	TIME	DV	CMT	EVID	AMT	WT	AGE	SEX	DVID
	Int64	Int64	Float64?	Int64?	Int64	Int64?	Float64	Int64	Int64	Int64
1	1	0	missing	1	1	100	66.7	50	1	1
2	1	0	100.0	missing	0	missing	66.7	50	1	2
3	1	12	9.2	missing	0	missing	66.7	50	1	1
4	1	24	49.0	missing	0	missing	66.7	50	1	2
5	1	32	8.5	missing	0	missing	66.7	50	1	1
6	1	36	32.0	missing	0	missing	66.7	50	1	2
7	1	44	6.4	missing	0	missing	66.7	50	1	1
8	1	48	26.0	missing	0	missing	66.7	50	1	2
9	1	66	4.8	missing	0	missing	66.7	50	1	1
10	1	72	22.0	missing	0	missing	66.7	50	1	2
11	1	90	3.1	missing	0	missing	66.7	50	1	1
12	1	96	28.0	missing	0	missing	66.7	50	1	2
13	1	112	2.5	missing	0	missing	66.7	50	1	1
14	1	120	33.0	missing	0	missing	66.7	50	1	2

Now we use the same unstack call as above:

wide_df = unstack(df, :DVID, :DV; renamecols = x -> Symbol(:DV_, x))

14×10 DataFrame

Row	ID	TIME	CMT	EVID	AMT	WT	AGE	SEX	DV_1	DV_2
	Int64	Int64	Int64?	Int64	Int64?	Float64	Int64	Int64	Float64?	Float64?
1	1	0	1	1	100	66.7	50	1	missing	missing
2	1	0	missing	0	missing	66.7	50	1	missing	100.0
3	1	12	missing	0	missing	66.7	50	1	9.2	missing
4	1	24	missing	0	missing	66.7	50	1	missing	49.0
5	1	32	missing	0	missing	66.7	50	1	8.5	missing
6	1	36	missing	0	missing	66.7	50	1	missing	32.0
7	1	44	missing	0	missing	66.7	50	1	6.4	missing
8	1	48	missing	0	missing	66.7	50	1	missing	26.0
9	1	66	missing	0	missing	66.7	50	1	4.8	missing
10	1	72	missing	0	missing	66.7	50	1	missing	22.0
11	1	90	missing	0	missing	66.7	50	1	3.1	missing
12	1	96	missing	0	missing	66.7	50	1	missing	28.0
13	1	112	missing	0	missing	66.7	50	1	2.5	missing
14	1	120	missing	0	missing	66.7	50	1	missing	33.0

This wide_df can be easily parsed into a Population using read_pumas:

using Pumas

pop = read_pumas(
    wide_df;
    id = :ID,
    time = :TIME,
    evid = :EVID,
    amt = :AMT,
    cmt = :CMT,
    observations = [:DV_1, :DV_2],
)

Population
  Subjects: 1
  Observations: DV_1, DV_2