janitor icon indicating copy to clipboard operation
janitor copied to clipboard

Feature Request: single_value()

Open billdenney opened this issue 3 years ago • 5 comments

This is a function that would consider some values to be missing, but for all non-missing values, it would ensure that they have the same value.

I often work with datasets where I need to combine information for subjects in clinical trials. For that, I need to ensure that I have the same information from each of the different sources. For example, I may have multiple sources for the age of a subject when they start the study.

When I combine those data sets, I need to end up with the age as the same across all data. A paradigm I often use is below. Would that be of interest?

library(tidyverse)
library(bsd.report)

my_data_good <-
  tibble(
    Subject=rep(1:2, each=2),
    Age=c(1, NA, 2, NA)
  ) %>%
  group_by(Subject) %>%
  mutate(
    Age=single_value(Age)
  )
my_data_good
#> # A tibble: 4 x 2
#> # Groups:   Subject [2]
#>   Subject   Age
#>     <int> <dbl>
#> 1       1     1
#> 2       1     1
#> 3       2     2
#> 4       2     2

my_data_bad <-
  tibble(
    Subject=rep(1:2, each=2),
    Age=c(1, NA, 2, 3)
  ) %>%
  group_by(Subject) %>%
  mutate(
    Age=single_value(Age)
  )
#> Error: Problem with `mutate()` input `Age`.
#> x More than one (2) value found (2, 3)
#> i Input `Age` is `single_value(Age)`.
#> i The error occurred in group 2: Subject = 2.

Created on 2021-02-04 by the reprex package (v1.0.0)

billdenney avatar Feb 05 '21 03:02 billdenney

@sfirke , If I make a PR for this, do you think it would be of interest? (And no worries if you think it's out of scope.)

billdenney avatar Apr 15 '22 15:04 billdenney

I like this! It's in scope IMO. It's related to this: https://github.com/sfirke/janitor/issues/18 There, I wanted a function for finding records like the one in my_data_bad above. I think we can address the situation more broadly. In my issue above I wanted a diagnostic function, in your example the function functions like tidyr::fill except it also includes a check against more than one distinct value - which is kind of diagnostic.

Do you have thoughts on if it's doable / the most elegant way to both offer the diagnostic functionality and the convenience wrapper for fill? One idea - not sure if this is the best: have the function succeed with the fill if there are no invalid combinations, and if there are invalid combos then it would fail and error and (??) return the bad records in a data.frame. It feels kinda clunky to squish that into one function, but maybe there's a way to both error and return the bad records? Or have the user specify?

Or maybe it should be two functions and you run the diagnostic one first, then the one you have above. That's probably more tidy-API style.

It would be nice if the diagnostic function could easily be used in an assertr call so that folks can throw a check in a pipeline to be sure there aren't the multiple values lurking.

sfirke avatar Apr 15 '22 15:04 sfirke

Hmm. I don't tend to use fill() because I don't often need locf-style (last-observation carried forward) imputation. But, maybe the right solution is to suggest a new .direction argument for fill() of "single". Let's hold off here since that seems like an overall-better solution. If they don't like it for tidyr, then let's revisit it here.

billdenney avatar Apr 15 '22 17:04 billdenney

If included here in janitor, I think that the diagnostic and fill functions would be separate. I wouldn't want code to accidentally expect the fill result and get the diagnostic result. For my work, the diagnosis is the error in the example above.

billdenney avatar Apr 15 '22 18:04 billdenney

We got an answer back that it's not a good fit for tidyr. I'll work up a PR.

billdenney avatar Jun 10 '22 15:06 billdenney

I agree that the error thrown by single_value serves the diagnostic function. It sort of has assertr-type of functionality built in, you can call single_value without expecting it to do anything but serve as a check against mismatches.

sfirke avatar Jan 03 '23 20:01 sfirke