tidyr
tidyr copied to clipboard
Can this joint-column `pivot_longer()` situation be simplified?
Real life situation that I ran into and had some trouble pivoting.
Treatments A and B were the "core" ones of interest, so they were recorded in their own columns. But a patient might be on another treatment too, so it was recorded in an other column. Each treatment has a treatment date that needs to be carried along with the treatment type.
This was particularly hard to pivot because A/A_date, B/B_date, and other/other_date are paired together, and I feel like we dont have a good way to express that?
The other way to do this (which is what we went with) is to split this into 3 data frames, id/A/A_date, id/B/B_date, and id/other/other_date then rename the columns of each to be id/treatment/date, drop missing rows from each, and bind-rows.
library(tidyverse)
treatments <- tibble(
id = 1:5,
A = c("A", NA, "A", NA, NA),
A_date = c(1, NA, 2, NA, NA),
B = c(NA, "B", "B", NA, NA),
B_date = c(NA, 3, 2, NA, NA),
other = c(NA, NA, NA, "C", "D"),
other_date = c(NA, NA, NA, 1, 5)
)
treatments
#> # A tibble: 5 × 7
#> id A A_date B B_date other other_date
#> <int> <chr> <dbl> <chr> <dbl> <chr> <dbl>
#> 1 1 A 1 <NA> NA <NA> NA
#> 2 2 <NA> NA B 3 <NA> NA
#> 3 3 A 2 B 2 <NA> NA
#> 4 4 <NA> NA <NA> NA C 1
#> 5 5 <NA> NA <NA> NA D 5
expected <- tibble(
id = c(1, 2, 3, 3, 4, 5),
treatment = c("A", "B", "A", "B", "C", "D"),
date = c(1, 3, 2, 2, 1, 5)
)
expected
#> # A tibble: 6 × 3
#> id treatment date
#> <dbl> <chr> <dbl>
#> 1 1 A 1
#> 2 2 B 3
#> 3 3 A 2
#> 4 3 B 2
#> 5 4 C 1
#> 6 5 D 5
# Pack into A/B/other df-cols with consistent inner column names
treatments2 <- treatments %>%
pack(
A = c(A, A_date),
B = c(B, B_date),
other = c(other, other_date)
) %>%
mutate(
A = rename(A, treatment = A, date = A_date),
B = rename(B, treatment = B, date = B_date),
other = rename(other, treatment = other, date = other_date)
)
treatments2
#> # A tibble: 5 × 4
#> id A$treatment $date B$treatment $date other$treatment $date
#> <int> <chr> <dbl> <chr> <dbl> <chr> <dbl>
#> 1 1 A 1 <NA> NA <NA> NA
#> 2 2 <NA> NA B 3 <NA> NA
#> 3 3 A 2 B 2 <NA> NA
#> 4 4 <NA> NA <NA> NA C 1
#> 5 5 <NA> NA <NA> NA D 5
# Pivot with no `name` column, and drop entirely missing rows
pivoted <- pivot_longer(
treatments2,
cols = c(A, B, other),
names_to = NULL,
values_drop_na = TRUE
)
pivoted
#> # A tibble: 6 × 2
#> id value$treatment $date
#> <int> <chr> <dbl>
#> 1 1 A 1
#> 2 2 B 3
#> 3 3 A 2
#> 4 3 B 2
#> 5 4 C 1
#> 6 5 D 5
# Unpack
pivoted %>%
unpack(value)
#> # A tibble: 6 × 3
#> id treatment date
#> <int> <chr> <dbl>
#> 1 1 A 1
#> 2 2 B 3
#> 3 3 A 2
#> 4 3 B 2
#> 5 4 C 1
#> 6 5 D 5
Created on 2022-07-11 by the reprex package (v2.0.1)
Side bar: Should pack() allow renaming in the tidyselection? That would avoid the rename() calls above. Not sure how that would interact with names_sep though - I guess names_sep would apply the stripping on the new names (i.e. on the treatment and date names).
treatments %>%
pack(
A = c(treatment = A, date = A_date),
B = c(treatment = B, date = B_date),
other = c(treatment = other, date = other_date)
)
This is pretty nice:
treatments_dt <- data.table::as.data.table(treatments)
data.table::melt(
treatments_dt,
measure.vars = list(
treatment = c("A", "B", "other"),
date = c("A_date", "B_date", "other_date")
),
na.rm = TRUE
)
#> id variable treatment date
#> 1: 1 1 A 1
#> 2: 3 1 A 2
#> 3: 2 2 B 3
#> 4: 3 2 B 2
#> 5: 4 3 C 1
#> 6: 5 3 D 5
I guess this is the closest we can get with pivot_longer() directly. Requiring names that match names_sep or names_pattern seems somewhat restrictive
treatments %>%
rename(A_treatment = A, B_treatment = B, other_treatment = other) %>%
pivot_longer(
cols = c(starts_with("A"), starts_with("B"), starts_with("other")),
names_to = c(NA, ".value"),
names_sep = "_",
values_drop_na = TRUE
)
#> # A tibble: 6 × 3
#> id treatment date
#> <int> <chr> <dbl>
#> 1 1 A 1
#> 2 2 B 3
#> 3 3 A 2
#> 4 3 B 2
#> 5 4 C 1
#> 6 5 D 5
Using a spec is much cleaner, and this spec doesn't look that complex.
spec <- tibble(
.name = c(
"A", "A_date",
"B", "B_date",
"other", "other_date"
),
.value = rep(c("treatment", "date"), times = 3)
)
spec
#> # A tibble: 6 × 2
#> .name .value
#> <chr> <chr>
#> 1 A treatment
#> 2 A_date date
#> 3 B treatment
#> 4 B_date date
#> 5 other treatment
#> 6 other_date date
pivot_longer_spec(treatments, spec, values_drop_na = TRUE)
#> # A tibble: 6 × 3
#> id treatment date
#> <int> <chr> <dbl>
#> 1 1 A 1
#> 2 2 B 3
#> 3 3 A 2
#> 4 3 B 2
#> 5 4 C 1
#> 6 5 D 5
I wonder if we could have pivot_longer() generate that spec somehow, like with:
# Maybe something like this?
treatments %>%
pivot_longer(
cols = c(starts_with("A"), starts_with("B"), starts_with("other")),
names_to = list(
treatment = c("A", "B", "other"),
date = c("A_date", "B_date", "other_date")
)
)
Using a named list for names_to would mean you can't use names_sep, names_pattern, or values_to.
I was just about to post something similar to the above when I saw this open issue. I second @DavisVaughan, because I am really struggling with teaching my students at Texas A&M University how to pivot_longer when the wide variables have multiple patterns. It would make life so much easier for beginners if, for example, values_to could easily handle multiple patterns. Having to teach students who are learning R to use regular expressions with names_pattern in class 2 or 3 is so restrictive. Here is another example to support @DavisVaughan. Let's say I have the following wide data frame:
df_wide = data.frame(country = c("France", "USA"),
poverty_rate_2005 = c(13.1,10),
poverty_rate_2010 = c(14.0,15.1),
inequality_2005 = c(29.8,41),
inequality_2010 = c(33.7,40.0)
)
print(df_wide)
country poverty_rate_2005 poverty_rate_2010 inequality_2005 inequality_2010
1 France 13.1 14.0 29.8 33.7
2 USA 10.0 15.1 41.0 40.0
Of course, I could reshape the data frame long using pivot_longer as follows with a regular expression in names_pattern:
df_wide %>%
pivot_longer(cols = !country,
names_to = c(".value", "year"),
names_pattern = "^(.*)_([0-9]{4})$",
names_transform = list(year = as.integer))
# A tibble: 4 × 4
country year poverty_rate inequality
<chr> <int> <dbl> <dbl>
1 France 2005 13.1 29.8
2 France 2010 14 33.7
3 USA 2005 10 41
4 USA 2010 15.1 40
However, it would be so intuitive for my students if they could just run something like:
df_wide %>%
pivot_longer(
cols = starts_with("poverty_rate_") | starts_with("inequality_"),
names_to = "year",
values_to = c("poverty_rate", "inequality")
)
Hopefully the above is helpful!
@DavisVaughan, have you heard anything from others?
No, but ill probably take an in depth look at this once we do the next minor release of tidyr, probably in a few months