REDCapTidieR icon indicating copy to clipboard operation
REDCapTidieR copied to clipboard

[FEATURE] Function to join data tibbles

Open ezraporter opened this issue 6 months ago • 6 comments

Feature Request Description

Function to join together data tibbles that picks smart default keys to join on based on the events each tibble is mapped to.

Proposed Solution

Some possible APIs:

supertbl |>
  join_data_tibble(
    x = "tbl1",
    y = "tbl2",
    by = c("record_id", "redcap_event"), # optional! Inferred by default
    type = "left"
   )

# Return a tibble

supertbl |>
  join_data_tibble(
    x = "tbl1",
    y = "tbl2",
    by = c("record_id", "redcap_event"), # optional! Inferred by default
    type = "left",
    name = "my_new_tibble"
   )

# Return a supertibble with a new tibble added

Additional Context

Data tibble types

API issues aside, I think the key here is that we'd need a way to determine for any two data tibbles which keys to use in a join by default.

Each instrument-event pair can have one of 3 types: Non Repeating, Repeating Separately, or Repeating Together. When I talk about the "type" of an instrument, I mean the aggregation of all the types of instrument-event pairs the instrument is part of. There are 7 options which I mocked up in an example REDCap.

Screenshot 2024-08-28 at 2 03 54 PM Screenshot 2024-08-28 at 2 02 40 PM
library(REDCapTidieR)
library(tidyverse)

data <- read_redcap(Sys.getenv("REDCAP_URI"), "77C2C3A96E00246436A9E49974BD5674", allow_mixed_structure = TRUE)

extract_keys <- function(data) {
  data |>
    colnames() |>
    intersect(c("record_id", "redcap_event", "redcap_form_instance", "redcap_event_instance")) |>
    paste(collapse = ", ")
}

data |>
  mutate(pks = map_chr(redcap_data, extract_keys)) |>
  select(redcap_form_label, pks, structure)

#> A REDCapTidieR Supertibble with 7 instruments
#>  redcap_form_label pks                                            structure   
#>  <chr>             <chr>                                          <chr>       
#>1 NR                record_id, redcap_event                        nonrepeating
#>2 RS                record_id, redcap_event, redcap_form_instance  repeating   
#>3 RT                record_id, redcap_event, redcap_event_instance nonrepeating
#>4 NR, RS            record_id, redcap_event, redcap_form_instance  mixed       
#>5 RS, RT            record_id, redcap_event, redcap_form_instance  mixed       
#>6 NR, RT            record_id, redcap_event, redcap_event_instance nonrepeating
#>7 NR, RS, RT        record_id, redcap_event, redcap_form_instance  mixed 

"Type" here is slightly more specific than "structure". I think for joining any two non-mixed instruments the default should be to join on all common keys. The one complication I see is the NA redcap_event_instance in the "NR, RT" case.

data |>
  extract_tibble("nr_rt")

#># A tibble: 2 × 5
#>  record_id redcap_event       redcap_event_instance nr_rt form_status_complete
#>      <dbl> <chr>                              <dbl> <dbl> <fct>               
#>1         1 non_repeating                         NA     1 Incomplete          
#>2         1 repeating_together                     1     1 Incomplete          

Presumably if you join "NR, RT" to "NR, RT" you want the NAs to join in the non repeating event which is a little non-standard.

Things get a little more complicated when considering the mixed structure instruments. For example, suppose you're joining "RS, RT" to "RS, RT". Presumably you'd want to join on all the keys for records in the RT event but on only record_id and redcap_event for the RS records.

If we wanted to cover all our bases we could try to enumerate the keys we'd expect to join on for all 28 possible combinations.

Missing keys

The above assumes that all instrument-event pairs have data. I noticed that when this isn't true the keys that come back for each data tibble aren't consistent. For example, in the "NR, RS" instrument redcap_form_instance is only included if data is filled out for that instrument in the RS event. Not sure how much this matters for the joins but it might be an additional consideration.

Checklist

  • [x] The issue is atomic
  • [x] The issue description is documented
  • [ ] The issue title describes the problem succinctly
  • [ ] Developers are assigned to the issue
  • [x] Labels are assigned to the issue

ezraporter avatar Aug 28 '24 19:08 ezraporter