Tplyr
Tplyr copied to clipboard
Initial tlang compatibility framework
This is the initial introduction of updating Tplyr's numeric summary framework to output a dataframe directly compatible with {tlang}.
There are two major components of this update that impact Tplyr's design:
- A new layer level interface
set_summaries() - An overhaul of the functionality of
get_numeric_data()which will be backwards compatibility breaking
The general output format we're chasing here loosely resembles the planned CDISC Analysis Results Data standard, following a long dataset format with one result value per row. Furthermore, the focus of {tlang} is to handle the formatting, presentation, and display of data from a consistent input format. As such, this using f_str() objects from Tplyr becomes redundant and unnecessary when coupled with {tlang}. As such, to eliminate this redundancy, Tplyr requires a new interface to specify which summaries should be returned.
set_summaries()
This function will be a common interface across layers, with the general parameter format <row label> = vars(var1, var2). For each layer type, in the output dataframe provided by get_numeric_data(), the row label would become a param column, and the variables specified would appear individually on separate rows.
The targeted syntax looks as follows:
t <- tplyr_table(adsl, TRT01P) %>%
add_layer(
group_desc(AGE, by = "Age (years)", where= SAFFL=="Y") %>%
set_summaries(
"n" = vars(n),
"Mean (SD)"= vars(mean, sd),
"Median" = vars(median),
"Q1, Q3" = vars(q1, q3),
"Min, Max" = vars(min, max),
"Missing" = vars(missing)
)
)
This PR focuses on desc layers, but the interface should follow for count or shift layers:
t <- tplyr_table(adsl, TRT01P) %>%
add_layer(
group_desc(AGE, by = "Age (years)", where= SAFFL=="Y") %>%
set_summaries(
"n (%)" = vars(n, pct),
"Total" = vars(denom),
)
)
get_numeric_data()
The get_numeric_data() function has always been a bit disjoint. Numeric data structures not been standardized, and the function currently returns each data's numeric results as a list of data frames. Given that this has little value to the user as it currently stands, get_numeric_data() will be updated to return a single, standardized dataframe which unifies the return format between each provided layer.
This will be a backwards compatibility breaking change. As such the API of get_numeric_data() is subject to change, and the output data format will definitely be changed to return a single dataframe instead of a list of dataframes.
@statasaurus tagging you to put this on your radar
@elimillera I want to keep the develop of this functionality off of devel until we're really ready. So the tlang_compatibility branch was created off of devel, and then this development branch is created off of tlang_compatibility. We can PR into tlang_compatibility until the functionality is ready and then we'll merge that into devel.
I'm currently not targetting completion of this feature in the v1.0.0 release.
Ok - I have a working proof of concept here. Without modifying the get_numeric_data() API, this has count layers and desc layers returning a standardized dataframe with renamed row labels - and a new addition here is that it'll renamed column variables to coln. So @elimillera we need shift layers to finish the basic POC.
@statasaurus and @thebioengineer, here's some test code working on the current branch to do a demographics summary and a nested adverse events table. Can you give some feedback?
Note - sorting is not yet handled with the numeric data, and this is a high level of difficulty because it's basically a total refactor of how sorting variables are handled by Tplyr, so it's going to take time to figure out a) how we want to approach it and b) if we'd want to tie it in with a larger concept of refactoring order variables in Tplyr.
# Read in data
adsl <- haven::read_xpt(url("https://github.com/phuse-org/TestDataFactory/raw/main/Updated/TDF_ADaM/adsl.xpt"))
adae <- haven::read_xpt(url("https://github.com/phuse-org/TestDataFactory/raw/main/Updated/TDF_ADaM/adae.xpt"))
# Demographics basics
demog <- tplyr_table(adsl, TRT01P) %>%
add_layer(
group_desc(AGE, by = "Age (years)") %>%
set_summaries(
"Mean (SD)" = vars(mean, sd),
"Min, Max" = vars(min, max),
"IQR" = vars(iqr)
)
) %>%
add_layer(
group_count(AGEGR1, by = "Age Groups") %>%
add_total_row() %>%
set_summaries(
"n (%)" = vars(n, pct)
)
) %>%
add_layer(
group_count(RACE, by = "Race") %>%
add_total_row() %>%
set_summaries(
"n (%)" = vars(n, pct)
)
) %>%
add_layer(
group_desc(WEIGHTBL, by = "Weight at Baseline") %>%
set_summaries(
"Mean (SD)" = vars(mean, sd),
"Min, Max" = vars(min, max),
"IQR" = vars(iqr)
)
) %>%
get_numeric_data()
# Nested Adverse Events
ae <- tplyr_table(adae, TRTA) %>%
set_pop_data(adsl) %>%
set_pop_treat_var(TRT01A) %>%
add_layer(
group_count(vars(AEBODSYS, AEDECOD)) %>%
set_indentation("") %>%
set_summaries(
"n (%)" = vars(distinct_n, distinct_pct, n)
)
) %>%
get_numeric_data()
So I know you guys are still working on this. But, when I try to run this code and build I get an error
t <- tplyr_table(adsl, TRT01P) %>% add_layer( group_desc(AGE, by = "Age (years)", where= SAFFL=="Y") %>% set_summaries( "n" = vars(n), "Mean (SD)"= vars(mean, sd), "Median" = vars(median), "Q1, Q3" = vars(q1, q3), "Min, Max" = vars(min, max), "Missing" = vars(missing) ) )
Error in env_get():
! Can't find max_length in environment.
Run rlang::last_error() to see where the error occurred.
When I run more or less the same code but with set_format_strings() and f_str it works just fine
Can you give the rlang::last_error() output?
Oh - wait. Run get_numeric_data() instead of build. The tlang compatibility is meant to work with that function instead of build()
You were right I think I used build. Question, would it be possible to make this work
tplyr_table(adsl, TRT01P, where= SAFFL =="Y") %>%
add_layer(
group_count(SEX, by = "Sex n (%)")
) %>%
add_layer(
group_desc(AGE, by = "Age (years)")
) %>%
get_numeric_data()
At the moment if you do this you get 10 columns with a mismash of names
You were right I think I used build. Question, would it be possible to make this work
tplyr_table(adsl, TRT01P, where= SAFFL =="Y") %>% add_layer( group_count(SEX, by = "Sex n (%)") ) %>% add_layer( group_desc(AGE, by = "Age (years)") ) %>% get_numeric_data()At the moment if you do this you get 10 columns with a mismash of names
@statasaurus pull my latest push and try now.
Originally, I only made the numeric data work properly when set_summaries() was used. Now I updated it to use either format strings or set_summaries(). On note though, you would need to create the table like this:
tplyr_table(adsl, TRT01P, where= SAFFL =="Y") %>%
add_layer(
group_count(SEX, by = vars("Sex", "n (%)") )
) %>%
add_layer(
group_desc(AGE, by = "Age (years)")
) %>%
get_numeric_data()
Reason being, count layers have now row label assigned by default for the layer like descriptive statistic layers do. That's unique to the way set_summaries() builds it because count layers only output one row per categorical value. So basically you can bypass it by providing an extra row label to the by parameter. I can't assume this value because if you change the format string, then that would change the row label. So this is the easiest way to make it work by default.
Let me know if you have any more comments!
So this used to work for me, but now doesn't
ae <- tplyr_table(adae, TRT01A) %>%
add_total_group() %>%
add_layer(
group_count(vars(AEBODSYS, AETERM)) %>%
set_summaries("n (%)"= vars(n, pct)) %>%
set_distinct_by(USUBJID)
) %>%
get_numeric_data()
This is the error I get
Error in group_by():
! Must group by variables found in .data.
✖ Column AEBODSYS is not found.
Run rlang::last_error() to see where the error occurred.
rlang::last_error() <error/rlang_error> Error in
group_by(): ! Must group by variables found in.data. ✖ ColumnAEBODSYSis not found. Backtrace:
- ... %>% get_numeric_data()
- dplyr:::group_by.data.frame(., !!target_var[[1]])
Run
rlang::last_trace()to see the full context.
If I keep everything else the same, just remove AETERM it works, just makes the wrong table
So this isn't super urgent for me but just so you know, now it can't build, only get_numeric_data
@statasaurus pushed fixes for the build
When using set_pop_data() the totals are off
@statasaurus fix should now be available for distinct counts with pop data:
load(test_path('adae.Rdata')
tplyr_table(adae, TRTA) %>%
set_pop_data(adsl) %>%
set_pop_treat_var(TRT01A) %>%
add_layer(
group_count("Any Body System") %>%
set_distinct_by(USUBJID)
) %>%
get_numeric_data()
# A tibble: 6 × 4
row_label1 col1 param value
<chr> <chr> <chr> <dbl>
1 Any Body System Placebo distinct_n 32
2 Any Body System Placebo distinct_pct 0.372
3 Any Body System Xanomeline High Dose distinct_n 43
4 Any Body System Xanomeline High Dose distinct_pct 0.512
5 Any Body System Xanomeline Low Dose distinct_n 50
6 Any Body System Xanomeline Low Dose distinct_pct 0.595
tplyr_table(adae, TRTA) %>%
set_pop_data(adsl) %>%
set_pop_treat_var(TRT01A) %>%
add_layer(
group_count("Any Body System") %>%
set_distinct_by(USUBJID)
) %>%
build() %>%
select(var1_Placebo)
# A tibble: 1 × 1
var1_Placebo
<chr>
1 32 ( 37.2%)