Tplyr icon indicating copy to clipboard operation
Tplyr copied to clipboard

Initial tlang compatibility framework

Open mstackhouse opened this issue 3 years ago • 13 comments
trafficstars

This is the initial introduction of updating Tplyr's numeric summary framework to output a dataframe directly compatible with {tlang}.

There are two major components of this update that impact Tplyr's design:

  1. A new layer level interface set_summaries()
  2. An overhaul of the functionality of get_numeric_data() which will be backwards compatibility breaking

The general output format we're chasing here loosely resembles the planned CDISC Analysis Results Data standard, following a long dataset format with one result value per row. Furthermore, the focus of {tlang} is to handle the formatting, presentation, and display of data from a consistent input format. As such, this using f_str() objects from Tplyr becomes redundant and unnecessary when coupled with {tlang}. As such, to eliminate this redundancy, Tplyr requires a new interface to specify which summaries should be returned.

set_summaries()

This function will be a common interface across layers, with the general parameter format <row label> = vars(var1, var2). For each layer type, in the output dataframe provided by get_numeric_data(), the row label would become a param column, and the variables specified would appear individually on separate rows.

The targeted syntax looks as follows:

t <- tplyr_table(adsl, TRT01P) %>%
  add_layer(
    group_desc(AGE, by = "Age (years)", where= SAFFL=="Y") %>%
      set_summaries(
        "n"        = vars(n),
        "Mean (SD)"= vars(mean, sd),
        "Median"   = vars(median),
        "Q1, Q3"   = vars(q1, q3),
        "Min, Max" = vars(min, max),
        "Missing"  = vars(missing)
      )
  )

This PR focuses on desc layers, but the interface should follow for count or shift layers:

t <- tplyr_table(adsl, TRT01P) %>%
  add_layer(
    group_desc(AGE, by = "Age (years)", where= SAFFL=="Y") %>%
      set_summaries(
        "n (%)"        = vars(n, pct),
        "Total"        = vars(denom),
      )
  )

get_numeric_data()

The get_numeric_data() function has always been a bit disjoint. Numeric data structures not been standardized, and the function currently returns each data's numeric results as a list of data frames. Given that this has little value to the user as it currently stands, get_numeric_data() will be updated to return a single, standardized dataframe which unifies the return format between each provided layer.

This will be a backwards compatibility breaking change. As such the API of get_numeric_data() is subject to change, and the output data format will definitely be changed to return a single dataframe instead of a list of dataframes.

mstackhouse avatar Jun 13 '22 15:06 mstackhouse

@statasaurus tagging you to put this on your radar

mstackhouse avatar Jun 13 '22 15:06 mstackhouse

@elimillera I want to keep the develop of this functionality off of devel until we're really ready. So the tlang_compatibility branch was created off of devel, and then this development branch is created off of tlang_compatibility. We can PR into tlang_compatibility until the functionality is ready and then we'll merge that into devel.

I'm currently not targetting completion of this feature in the v1.0.0 release.

mstackhouse avatar Jun 13 '22 15:06 mstackhouse

Ok - I have a working proof of concept here. Without modifying the get_numeric_data() API, this has count layers and desc layers returning a standardized dataframe with renamed row labels - and a new addition here is that it'll renamed column variables to coln. So @elimillera we need shift layers to finish the basic POC.

@statasaurus and @thebioengineer, here's some test code working on the current branch to do a demographics summary and a nested adverse events table. Can you give some feedback?

Note - sorting is not yet handled with the numeric data, and this is a high level of difficulty because it's basically a total refactor of how sorting variables are handled by Tplyr, so it's going to take time to figure out a) how we want to approach it and b) if we'd want to tie it in with a larger concept of refactoring order variables in Tplyr.

# Read in data
adsl <- haven::read_xpt(url("https://github.com/phuse-org/TestDataFactory/raw/main/Updated/TDF_ADaM/adsl.xpt"))
adae <- haven::read_xpt(url("https://github.com/phuse-org/TestDataFactory/raw/main/Updated/TDF_ADaM/adae.xpt"))

# Demographics basics
demog <- tplyr_table(adsl, TRT01P) %>% 
  add_layer(
    group_desc(AGE, by = "Age (years)") %>% 
      set_summaries(
        "Mean (SD)" = vars(mean, sd),
        "Min, Max" = vars(min, max), 
        "IQR" = vars(iqr)
      )
  ) %>% 
  add_layer(
    group_count(AGEGR1, by = "Age Groups") %>% 
      add_total_row() %>% 
      set_summaries(
        "n (%)" = vars(n, pct)
      )
  ) %>% 
  add_layer(
    group_count(RACE, by = "Race") %>% 
      add_total_row() %>% 
      set_summaries(
        "n (%)" = vars(n, pct)
      )
  ) %>% 
  add_layer(
    group_desc(WEIGHTBL, by = "Weight at Baseline") %>% 
      set_summaries(
        "Mean (SD)" = vars(mean, sd),
        "Min, Max" = vars(min, max), 
        "IQR" = vars(iqr)
      )
  ) %>% 
  get_numeric_data()

# Nested Adverse Events
ae <- tplyr_table(adae, TRTA) %>% 
  set_pop_data(adsl) %>% 
  set_pop_treat_var(TRT01A) %>% 
  add_layer(
    group_count(vars(AEBODSYS, AEDECOD)) %>% 
      set_indentation("") %>% 
      set_summaries(
        "n (%)" = vars(distinct_n, distinct_pct, n)
      )
  ) %>% 
  get_numeric_data()

mstackhouse avatar Jun 18 '22 18:06 mstackhouse

So I know you guys are still working on this. But, when I try to run this code and build I get an error

t <- tplyr_table(adsl, TRT01P) %>%
  add_layer(
    group_desc(AGE, by = "Age (years)", where= SAFFL=="Y") %>%
      set_summaries(
        "n"        = vars(n),
        "Mean (SD)"= vars(mean, sd),
        "Median"   = vars(median),
        "Q1, Q3"   = vars(q1, q3),
        "Min, Max" = vars(min, max),
        "Missing"  = vars(missing)
      )
  )

Error in env_get(): ! Can't find max_length in environment. Run rlang::last_error() to see where the error occurred.

When I run more or less the same code but with set_format_strings() and f_str it works just fine

statasaurus avatar Jun 27 '22 07:06 statasaurus

Can you give the rlang::last_error() output?

mstackhouse avatar Jun 27 '22 11:06 mstackhouse

Oh - wait. Run get_numeric_data() instead of build. The tlang compatibility is meant to work with that function instead of build()

mstackhouse avatar Jun 27 '22 11:06 mstackhouse

You were right I think I used build. Question, would it be possible to make this work

tplyr_table(adsl, TRT01P, where= SAFFL =="Y") %>% 
add_layer(
  group_count(SEX, by = "Sex n (%)") 
) %>% 
  add_layer(
    group_desc(AGE, by = "Age (years)") 
  ) %>% 
  get_numeric_data()

At the moment if you do this you get 10 columns with a mismash of names

statasaurus avatar Jun 27 '22 14:06 statasaurus

You were right I think I used build. Question, would it be possible to make this work

tplyr_table(adsl, TRT01P, where= SAFFL =="Y") %>% 
add_layer(
  group_count(SEX, by = "Sex n (%)") 
) %>% 
  add_layer(
    group_desc(AGE, by = "Age (years)") 
  ) %>% 
  get_numeric_data()

At the moment if you do this you get 10 columns with a mismash of names

@statasaurus pull my latest push and try now.

Originally, I only made the numeric data work properly when set_summaries() was used. Now I updated it to use either format strings or set_summaries(). On note though, you would need to create the table like this:

tplyr_table(adsl, TRT01P, where= SAFFL =="Y") %>% 
  add_layer(
    group_count(SEX, by = vars("Sex", "n (%)") )
  ) %>% 
  add_layer(
    group_desc(AGE, by = "Age (years)") 
  ) %>% 
  get_numeric_data()

Reason being, count layers have now row label assigned by default for the layer like descriptive statistic layers do. That's unique to the way set_summaries() builds it because count layers only output one row per categorical value. So basically you can bypass it by providing an extra row label to the by parameter. I can't assume this value because if you change the format string, then that would change the row label. So this is the easiest way to make it work by default.

Let me know if you have any more comments!

mstackhouse avatar Jun 28 '22 02:06 mstackhouse

So this used to work for me, but now doesn't

ae <- tplyr_table(adae, TRT01A) %>%
  add_total_group() %>% 
  add_layer(
    group_count(vars(AEBODSYS, AETERM)) %>% 
      set_summaries("n (%)"= vars(n, pct)) %>% 
      set_distinct_by(USUBJID) 
  )  %>% 
  get_numeric_data()

This is the error I get Error in group_by(): ! Must group by variables found in .data. ✖ Column AEBODSYS is not found. Run rlang::last_error() to see where the error occurred.

rlang::last_error() <error/rlang_error> Error in group_by(): ! Must group by variables found in .data. ✖ Column AEBODSYS is not found. Backtrace:

  1. ... %>% get_numeric_data()
  2. dplyr:::group_by.data.frame(., !!target_var[[1]]) Run rlang::last_trace() to see the full context.

If I keep everything else the same, just remove AETERM it works, just makes the wrong table

statasaurus avatar Jun 28 '22 09:06 statasaurus

So this isn't super urgent for me but just so you know, now it can't build, only get_numeric_data

statasaurus avatar Jun 29 '22 08:06 statasaurus

@statasaurus pushed fixes for the build

mstackhouse avatar Jun 30 '22 01:06 mstackhouse

When using set_pop_data() the totals are off

statasaurus avatar Jul 14 '22 19:07 statasaurus

@statasaurus fix should now be available for distinct counts with pop data:

load(test_path('adae.Rdata')

tplyr_table(adae, TRTA) %>%
  set_pop_data(adsl) %>%
  set_pop_treat_var(TRT01A) %>%
  add_layer(
    group_count("Any Body System") %>%
      set_distinct_by(USUBJID) 
  ) %>%
  get_numeric_data()

# A tibble: 6 × 4
  row_label1      col1                 param         value
  <chr>           <chr>                <chr>         <dbl>
1 Any Body System Placebo              distinct_n   32    
2 Any Body System Placebo              distinct_pct  0.372
3 Any Body System Xanomeline High Dose distinct_n   43    
4 Any Body System Xanomeline High Dose distinct_pct  0.512
5 Any Body System Xanomeline Low Dose  distinct_n   50    
6 Any Body System Xanomeline Low Dose  distinct_pct  0.595
tplyr_table(adae, TRTA) %>%
  set_pop_data(adsl) %>%
  set_pop_treat_var(TRT01A) %>%
  add_layer(
    group_count("Any Body System") %>%
      set_distinct_by(USUBJID) 
  ) %>%
  build() %>% 
  select(var1_Placebo)

# A tibble: 1 × 1
  var1_Placebo
  <chr>       
1 32 ( 37.2%) 

mstackhouse avatar Jul 19 '22 14:07 mstackhouse