dplyr Implement `enforce()` and friends

Implement `enforce()` and friends

Open DavisVaughan opened this issue 2 years ago • 5 comments

library(dplyr)

# Errors if there are any failures, otherwise returns `mtcars` invisibly
# (so it can be used in a pipeline)
mtcars %>%
  enforce(
    "MPG meets minimum guidelines" = mpg > 12, 
    "Cylinders are within range [4, 8]" = between(cyl, 4, 8),
    hp < 240
  )
#> Error: Enforcement failed. The following requirements were not met:
#> • 2 rows failed: MPG meets minimum guidelines.
#> • 4 rows failed: hp < 240.
#> Locate failures by calling `enforce_last()`.

enforce_last()
#> # A tibble: 6 × 3
#>   requirement                    row data$mpg  $cyl $disp   $hp $drat   $wt $qsec   $vs   $am $gear $carb
#>   <chr>                        <int>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 MPG meets minimum guidelines    15     10.4     8   472   205  2.93  5.25  18.0     0     0     3     4
#> 2 MPG meets minimum guidelines    16     10.4     8   460   215  3     5.42  17.8     0     0     3     4
#> 3 hp < 240                         7     14.3     8   360   245  3.21  3.57  15.8     0     0     3     4
#> 4 hp < 240                        24     13.3     8   350   245  3.73  3.84  15.4     0     0     3     4
#> 5 hp < 240                        29     15.8     8   351   264  4.22  3.17  14.5     0     1     5     4
#> 6 hp < 240                        31     15       8   301   335  3.54  3.57  14.6     0     1     5     8

# Or go straight to the failure tibble to more easily compute on it
mtcars %>%
  enforce_show(
    "MPG meets minimum guidelines" = mpg > 12, 
    "Cylinders are within range [4, 8]" = between(cyl, 4, 8),
    hp < 240
  )
#> # A tibble: 6 × 3
#>   requirement                    row data$mpg  $cyl $disp   $hp $drat   $wt $qsec   $vs   $am $gear $carb
#>   <chr>                        <int>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 MPG meets minimum guidelines    15     10.4     8   472   205  2.93  5.25  18.0     0     0     3     4
#> 2 MPG meets minimum guidelines    16     10.4     8   460   215  3     5.42  17.8     0     0     3     4
#> 3 hp < 240                         7     14.3     8   360   245  3.21  3.57  15.8     0     0     3     4
#> 4 hp < 240                        24     13.3     8   350   245  3.73  3.84  15.4     0     0     3     4
#> 5 hp < 240                        29     15.8     8   351   264  4.22  3.17  14.5     0     1     5     4
#> 6 hp < 240                        31     15       8   301   335  3.54  3.57  14.6     0     1     5     8

Aug 24 '21 20:08 DavisVaughan

If we implement this, we'll need to be crystal clear about the scope; it's not a competitor to pointblank or similar, and it's unlikely to gain any more features. It's a narrowly scoped data checking framework that is very easy to use from dplyr.

While a packed column feels like the "right" data structure here, I don't think it's viable in such a user facing function. I wonder if perhaps we need to split enforce_show() into two functions? enforce_failures() would just filter for the failing rows, and enforce_reasons() would include the reasons? Still not sure how you'd join them together in the absence of a unique key. Maybe both would get a .row column that would reference the row in the original data frame?

Sep 10 '21 15:09 hadley

While you are thinking about this function, one thing I'm not sure what to do with is the fact that you can currently use expressions that return a scalar, and they "work" because transmute() recycles it for us and we never get a chance to validate that the expression generated something the same length as the original input

# works because `transmute()` recycled the single `TRUE` before we had
# access to it
dplyr:::enforce(mtcars, is.double(mpg))

This is not the way you should use enforce(), and it also gives awful advice if the scalar condition fails, but I'm not yet sure how to catch this

dplyr:::enforce(mtcars, !is.double(mpg))
#> Error: Enforcement failed. The following requirements were not met:
#> x 32 rows failed: !is.double(mpg).
#> ℹ Locate failures by calling `enforce_last()`.

dplyr:::enforce_last()
#> # A tibble: 32 × 3
#>    requirement      row data$mpg  $cyl $disp   $hp $drat   $wt $qsec   $vs   $am
#>    <chr>          <int>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 !is.double(mp…     1     21       6  160    110  3.9   2.62  16.5     0     1
#>  2 !is.double(mp…     2     21       6  160    110  3.9   2.88  17.0     0     1
#>  3 !is.double(mp…     3     22.8     4  108     93  3.85  2.32  18.6     1     1
#>  4 !is.double(mp…     4     21.4     6  258    110  3.08  3.22  19.4     1     0
#>  5 !is.double(mp…     5     18.7     8  360    175  3.15  3.44  17.0     0     0
#>  6 !is.double(mp…     6     18.1     6  225    105  2.76  3.46  20.2     1     0
#>  7 !is.double(mp…     7     14.3     8  360    245  3.21  3.57  15.8     0     0
#>  8 !is.double(mp…     8     24.4     4  147.    62  3.69  3.19  20       1     0
#>  9 !is.double(mp…     9     22.8     4  141.    95  3.92  3.15  22.9     1     0
#> 10 !is.double(mp…    10     19.2     6  168.   123  3.92  3.44  18.3     1     0
#> # … with 22 more rows

Sep 10 '21 16:09 DavisVaughan

Would it be so bad to just use tibble::add_column() to add .requirement and .row columns to the front of the filtered data set without any name repair?

That seems like the simplest approach that still maintains a useful data structure. Since this is a somewhat ephemeral data frame, I'm not too worried about name collisions. Even if they happen, you can still print the data frame and take a look at the failure locations and ideally that's all you care about doing.

Sep 10 '21 16:09 DavisVaughan

For a first pass I think it's fine to not worry about the scalar problem and to just use add_column()

Sep 10 '21 17:09 hadley

Maybe think about removing some of the user friendly bits like enforce_last() and encourage people to use things like pointblank if they need more completely handling of these ideas

Apr 20 '22 13:04 DavisVaughan

We have decided not to pursue this further for now. It feels like we either create an API that goes too far for what dplyr should do, or too simple to the point that it isn't useful. If people need something like this, they should use a fully baked package like pointblank for it. If they need a helper that checks if a join key is valid, they can use dm::check_key()

Aug 10 '22 18:08 DavisVaughan

dplyr dplyr copied to clipboard

Implement `enforce()` and friends

dplyr
dplyr copied to clipboard