FR: Option to load missing reasons / codes as separate columns

Open khusmann opened this issue 2 years ago • 1 comments

In many datasets, missing values are interlaced with data as codes or strings. read_delim() et al., presently have an option to replace these values with NA, but do not have an easy way to do anything else with these values. For example:

na_strings_csv <- "
person_id,age,favorite_color
1,20,BLUE
1,_DECLINED_ANSWER_,BLUE
2,21,RED
3,30,_TECHNICAL_ERROR_
4,31,_DECLINED_ANSWER_
5,41,RED
6,50,_TECHNICAL_ERROR_
"

To load these data we'd run read_csv(na_strings_csv, na=c("_DECLINED_ANSWER_", "_TECHNICAL_ERROR_"))

But if we want to load the missing reasons as its own dataframe, or load the missing reasons in separate columns with a suffix, it requires loading the entire dataframe as text, filtering for the missing reasons and joining to the original dataframe. Given the extensive number of datasets that interlace values and missing reasons, it would be really nice to have an extra option on read_* to make this common task more ergonomic. I propose having a channels arg having the following functionality:

channels="values" gives the default behavior, but channels="missing" loads the missing reasons.

read_csv(
  na_strings_csv,
  na=c("_DECLINED_ANSWER_", "_TECHNICAL_ERROR_"),
  channels="missing"
)

## # A tibble: 7 × 3
##   person_id age               favorite_color   
##   <chr>     <chr>             <chr>            
## 1 <NA>      <NA>              <NA>             
## 2 <NA>      _DECLINED_ANSWER_ <NA>             
## 3 <NA>      <NA>              <NA>             
## 4 <NA>      <NA>              _TECHNICAL_ERROR_
## 5 <NA>      <NA>              _DECLINED_ANSWER_
## 6 <NA>      <NA>              <NA>             
## 7 <NA>      <NA>              _TECHNICAL_ERROR_

Both values and missingness can be loaded simultaneously as separate columns by passing channels=c("values", "missing")

read_csv(
  na_strings_csv,
  na=c("_DECLINED_ANSWER_", "_TECHNICAL_ERROR_"),
  channels=c("values", "missing"),
)

## # A tibble: 7 × 6
##   person_id_values age_values favorite_color_values person_id_missing
##              <int>      <dbl> <chr>                 <chr>            
## 1                1         20 BLUE                  <NA>             
## 2                1         NA BLUE                  <NA>             
## 3                2         21 RED                   <NA>             
## 4                3         30 <NA>                  <NA>             
## 5                4         31 <NA>                  <NA>             
## 6                5         41 RED                   <NA>             
## 7                6         50 <NA>                  <NA>             
## # ℹ 2 more variables: age_missing <chr>, favorite_color_missing <chr>

Using a named vector for channels controls the suffixes of the columns

read_csv(
  na_strings_csv,
  na=c("_DECLINED_ANSWER_", "_TECHNICAL_ERROR_"),
  channels=c(values="", missing="_missing"),
)
## # A tibble: 7 × 6
##   person_id   age favorite_color person_id_missing age_missing      
##       <int> <dbl> <chr>          <chr>             <chr>            
## 1         1    20 BLUE           <NA>              <NA>             
## 2         1    NA BLUE           <NA>              _DECLINED_ANSWER_
## 3         2    21 RED            <NA>              <NA>             
## 4         3    30 <NA>           <NA>              <NA>             
## 5         4    31 <NA>           <NA>              <NA>             
## 6         5    41 RED            <NA>              <NA>             
## 7         6    50 <NA>           <NA>              <NA>             
## # ℹ 1 more variable: favorite_color_missing <chr>

Loading values and missing reasons in separate columns like this greatly facilitates manipulation with tidyverse aggregation & filtering functions. For example, to find the average age of the of individuals that had technical errors reporting their favorite color:

read_csv(
  na_strings_csv,
  na=c("_DECLINED_ANSWER_", "_TECHNICAL_ERROR_"),
  channels=c(values="", missing="_missing"),
) %>%
  filter(favorite_color_missing == "_TECHNICAL_ERROR_") %>%
  summarize(
    age = mean(age, na.rm=T)
  )

I provide some more examples here here (I also include an example / naive implementation of the above api).

Please let me know if this idea is of interest, and I'd be happy to work on a PR. Cheers!

Nov 27 '23 19:11 khusmann

I've spun this idea into its own package here: https://kylehusmann.com/interlacer/

The main limitation of the package is around column-level missing values. It forces me to load files as character vectors and then use type_convert() which doesn't have complete parity with the vroom read_* functions.

So even though this package works as a sort of stop-gap, I still think this kind of functionality would still be better built-in to readr / vroom.

Mar 25 '24 16:03 khusmann