tidyr icon indicating copy to clipboard operation
tidyr copied to clipboard

Can this joint-column `pivot_longer()` situation be simplified?

Open DavisVaughan opened this issue 3 years ago • 6 comments
trafficstars

Real life situation that I ran into and had some trouble pivoting.

Treatments A and B were the "core" ones of interest, so they were recorded in their own columns. But a patient might be on another treatment too, so it was recorded in an other column. Each treatment has a treatment date that needs to be carried along with the treatment type.

This was particularly hard to pivot because A/A_date, B/B_date, and other/other_date are paired together, and I feel like we dont have a good way to express that?

The other way to do this (which is what we went with) is to split this into 3 data frames, id/A/A_date, id/B/B_date, and id/other/other_date then rename the columns of each to be id/treatment/date, drop missing rows from each, and bind-rows.

library(tidyverse)

treatments <- tibble(
  id = 1:5,
  A = c("A", NA, "A", NA, NA),
  A_date = c(1, NA, 2, NA, NA),
  B = c(NA, "B", "B", NA, NA),
  B_date = c(NA, 3, 2, NA, NA),
  other = c(NA, NA, NA, "C", "D"),
  other_date = c(NA, NA, NA, 1, 5)
)
treatments
#> # A tibble: 5 × 7
#>      id A     A_date B     B_date other other_date
#>   <int> <chr>  <dbl> <chr>  <dbl> <chr>      <dbl>
#> 1     1 A          1 <NA>      NA <NA>          NA
#> 2     2 <NA>      NA B          3 <NA>          NA
#> 3     3 A          2 B          2 <NA>          NA
#> 4     4 <NA>      NA <NA>      NA C              1
#> 5     5 <NA>      NA <NA>      NA D              5

expected <- tibble(
  id = c(1, 2, 3, 3, 4, 5),
  treatment = c("A", "B", "A", "B", "C", "D"),
  date = c(1, 3, 2, 2, 1, 5)
)
expected
#> # A tibble: 6 × 3
#>      id treatment  date
#>   <dbl> <chr>     <dbl>
#> 1     1 A             1
#> 2     2 B             3
#> 3     3 A             2
#> 4     3 B             2
#> 5     4 C             1
#> 6     5 D             5

# Pack into A/B/other df-cols with consistent inner column names
treatments2 <- treatments %>%
  pack(
    A = c(A, A_date),
    B = c(B, B_date),
    other = c(other, other_date)
  ) %>%
  mutate(
    A = rename(A, treatment = A, date = A_date),
    B = rename(B, treatment = B, date = B_date),
    other = rename(other, treatment = other, date = other_date)
  )

treatments2
#> # A tibble: 5 × 4
#>      id A$treatment $date B$treatment $date other$treatment $date
#>   <int> <chr>       <dbl> <chr>       <dbl> <chr>           <dbl>
#> 1     1 A               1 <NA>           NA <NA>               NA
#> 2     2 <NA>           NA B               3 <NA>               NA
#> 3     3 A               2 B               2 <NA>               NA
#> 4     4 <NA>           NA <NA>           NA C                   1
#> 5     5 <NA>           NA <NA>           NA D                   5

# Pivot with no `name` column, and drop entirely missing rows
pivoted <- pivot_longer(
  treatments2,
  cols = c(A, B, other),
  names_to = NULL,
  values_drop_na = TRUE
)

pivoted
#> # A tibble: 6 × 2
#>      id value$treatment $date
#>   <int> <chr>           <dbl>
#> 1     1 A                   1
#> 2     2 B                   3
#> 3     3 A                   2
#> 4     3 B                   2
#> 5     4 C                   1
#> 6     5 D                   5

# Unpack
pivoted %>%
  unpack(value)
#> # A tibble: 6 × 3
#>      id treatment  date
#>   <int> <chr>     <dbl>
#> 1     1 A             1
#> 2     2 B             3
#> 3     3 A             2
#> 4     3 B             2
#> 5     4 C             1
#> 6     5 D             5

Created on 2022-07-11 by the reprex package (v2.0.1)

DavisVaughan avatar Jul 11 '22 13:07 DavisVaughan

Side bar: Should pack() allow renaming in the tidyselection? That would avoid the rename() calls above. Not sure how that would interact with names_sep though - I guess names_sep would apply the stripping on the new names (i.e. on the treatment and date names).

treatments %>%
  pack(
    A = c(treatment = A, date = A_date),
    B = c(treatment = B, date = B_date),
    other = c(treatment = other, date = other_date)
  )

DavisVaughan avatar Jul 11 '22 13:07 DavisVaughan

This is pretty nice:

treatments_dt <- data.table::as.data.table(treatments)

data.table::melt(
  treatments_dt,
  measure.vars = list(
    treatment = c("A", "B", "other"),
    date = c("A_date", "B_date", "other_date")
  ),
  na.rm = TRUE
)
#>    id variable treatment date
#> 1:  1        1         A    1
#> 2:  3        1         A    2
#> 3:  2        2         B    3
#> 4:  3        2         B    2
#> 5:  4        3         C    1
#> 6:  5        3         D    5

DavisVaughan avatar Jul 11 '22 14:07 DavisVaughan

I guess this is the closest we can get with pivot_longer() directly. Requiring names that match names_sep or names_pattern seems somewhat restrictive

treatments %>%
  rename(A_treatment = A, B_treatment = B, other_treatment = other)  %>%
  pivot_longer(
    cols = c(starts_with("A"), starts_with("B"), starts_with("other")),
    names_to = c(NA, ".value"),
    names_sep = "_",
    values_drop_na = TRUE
  )
#> # A tibble: 6 × 3
#>      id treatment  date
#>   <int> <chr>     <dbl>
#> 1     1 A             1
#> 2     2 B             3
#> 3     3 A             2
#> 4     3 B             2
#> 5     4 C             1
#> 6     5 D             5

Using a spec is much cleaner, and this spec doesn't look that complex.

spec <- tibble(
  .name = c(
    "A", "A_date", 
    "B", "B_date", 
    "other", "other_date"
  ),
  .value = rep(c("treatment", "date"), times = 3)
)
spec
#> # A tibble: 6 × 2
#>   .name      .value   
#>   <chr>      <chr>    
#> 1 A          treatment
#> 2 A_date     date     
#> 3 B          treatment
#> 4 B_date     date     
#> 5 other      treatment
#> 6 other_date date

pivot_longer_spec(treatments, spec, values_drop_na = TRUE)
#> # A tibble: 6 × 3
#>      id treatment  date
#>   <int> <chr>     <dbl>
#> 1     1 A             1
#> 2     2 B             3
#> 3     3 A             2
#> 4     3 B             2
#> 5     4 C             1
#> 6     5 D             5

I wonder if we could have pivot_longer() generate that spec somehow, like with:

# Maybe something like this?
treatments %>%
  pivot_longer(
    cols = c(starts_with("A"), starts_with("B"), starts_with("other")),
    names_to = list(
      treatment = c("A", "B", "other"),
      date = c("A_date", "B_date", "other_date")
    )
  )

Using a named list for names_to would mean you can't use names_sep, names_pattern, or values_to.

DavisVaughan avatar Jul 11 '22 14:07 DavisVaughan

I was just about to post something similar to the above when I saw this open issue. I second @DavisVaughan, because I am really struggling with teaching my students at Texas A&M University how to pivot_longer when the wide variables have multiple patterns. It would make life so much easier for beginners if, for example, values_to could easily handle multiple patterns. Having to teach students who are learning R to use regular expressions with names_pattern in class 2 or 3 is so restrictive. Here is another example to support @DavisVaughan. Let's say I have the following wide data frame:

df_wide = data.frame(country = c("France", "USA"),
                     poverty_rate_2005 = c(13.1,10),
                     poverty_rate_2010 = c(14.0,15.1),
                     inequality_2005 = c(29.8,41),
                     inequality_2010 = c(33.7,40.0)
                     )
print(df_wide)
  country poverty_rate_2005 poverty_rate_2010 inequality_2005 inequality_2010
1  France              13.1              14.0            29.8            33.7
2     USA              10.0              15.1            41.0            40.0

Of course, I could reshape the data frame long using pivot_longer as follows with a regular expression in names_pattern:

df_wide %>% 
  pivot_longer(cols = !country,
               names_to = c(".value", "year"),
               names_pattern = "^(.*)_([0-9]{4})$",
               names_transform = list(year = as.integer))

# A tibble: 4 × 4
  country  year poverty_rate inequality
  <chr>   <int>        <dbl>      <dbl>
1 France   2005         13.1       29.8
2 France   2010         14         33.7
3 USA      2005         10         41  
4 USA      2010         15.1       40 

However, it would be so intuitive for my students if they could just run something like:

df_wide %>%
  pivot_longer(
    cols = starts_with("poverty_rate_") | starts_with("inequality_"),
    names_to = "year",
    values_to = c("poverty_rate", "inequality")
  )

Hopefully the above is helpful!

mikedenly avatar Sep 07 '23 06:09 mikedenly

@DavisVaughan, have you heard anything from others?

mikedenly avatar Sep 16 '23 20:09 mikedenly

No, but ill probably take an in depth look at this once we do the next minor release of tidyr, probably in a few months

DavisVaughan avatar Sep 17 '23 13:09 DavisVaughan