dplyr
dplyr copied to clipboard
Implement `enforce()` and friends
library(dplyr)
# Errors if there are any failures, otherwise returns `mtcars` invisibly
# (so it can be used in a pipeline)
mtcars %>%
enforce(
"MPG meets minimum guidelines" = mpg > 12,
"Cylinders are within range [4, 8]" = between(cyl, 4, 8),
hp < 240
)
#> Error: Enforcement failed. The following requirements were not met:
#> • 2 rows failed: MPG meets minimum guidelines.
#> • 4 rows failed: hp < 240.
#> Locate failures by calling `enforce_last()`.
enforce_last()
#> # A tibble: 6 × 3
#> requirement row data$mpg $cyl $disp $hp $drat $wt $qsec $vs $am $gear $carb
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 MPG meets minimum guidelines 15 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
#> 2 MPG meets minimum guidelines 16 10.4 8 460 215 3 5.42 17.8 0 0 3 4
#> 3 hp < 240 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 4 hp < 240 24 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
#> 5 hp < 240 29 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
#> 6 hp < 240 31 15 8 301 335 3.54 3.57 14.6 0 1 5 8
# Or go straight to the failure tibble to more easily compute on it
mtcars %>%
enforce_show(
"MPG meets minimum guidelines" = mpg > 12,
"Cylinders are within range [4, 8]" = between(cyl, 4, 8),
hp < 240
)
#> # A tibble: 6 × 3
#> requirement row data$mpg $cyl $disp $hp $drat $wt $qsec $vs $am $gear $carb
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 MPG meets minimum guidelines 15 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
#> 2 MPG meets minimum guidelines 16 10.4 8 460 215 3 5.42 17.8 0 0 3 4
#> 3 hp < 240 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 4 hp < 240 24 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
#> 5 hp < 240 29 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
#> 6 hp < 240 31 15 8 301 335 3.54 3.57 14.6 0 1 5 8
If we implement this, we'll need to be crystal clear about the scope; it's not a competitor to pointblank or similar, and it's unlikely to gain any more features. It's a narrowly scoped data checking framework that is very easy to use from dplyr.
While a packed column feels like the "right" data structure here, I don't think it's viable in such a user facing function. I wonder if perhaps we need to split enforce_show()
into two functions? enforce_failures()
would just filter for the failing rows, and enforce_reasons()
would include the reasons? Still not sure how you'd join them together in the absence of a unique key. Maybe both would get a .row
column that would reference the row in the original data frame?
While you are thinking about this function, one thing I'm not sure what to do with is the fact that you can currently use expressions that return a scalar, and they "work" because transmute()
recycles it for us and we never get a chance to validate that the expression generated something the same length as the original input
# works because `transmute()` recycled the single `TRUE` before we had
# access to it
dplyr:::enforce(mtcars, is.double(mpg))
This is not the way you should use enforce()
, and it also gives awful advice if the scalar condition fails, but I'm not yet sure how to catch this
dplyr:::enforce(mtcars, !is.double(mpg))
#> Error: Enforcement failed. The following requirements were not met:
#> x 32 rows failed: !is.double(mpg).
#> ℹ Locate failures by calling `enforce_last()`.
dplyr:::enforce_last()
#> # A tibble: 32 × 3
#> requirement row data$mpg $cyl $disp $hp $drat $wt $qsec $vs $am
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 !is.double(mp… 1 21 6 160 110 3.9 2.62 16.5 0 1
#> 2 !is.double(mp… 2 21 6 160 110 3.9 2.88 17.0 0 1
#> 3 !is.double(mp… 3 22.8 4 108 93 3.85 2.32 18.6 1 1
#> 4 !is.double(mp… 4 21.4 6 258 110 3.08 3.22 19.4 1 0
#> 5 !is.double(mp… 5 18.7 8 360 175 3.15 3.44 17.0 0 0
#> 6 !is.double(mp… 6 18.1 6 225 105 2.76 3.46 20.2 1 0
#> 7 !is.double(mp… 7 14.3 8 360 245 3.21 3.57 15.8 0 0
#> 8 !is.double(mp… 8 24.4 4 147. 62 3.69 3.19 20 1 0
#> 9 !is.double(mp… 9 22.8 4 141. 95 3.92 3.15 22.9 1 0
#> 10 !is.double(mp… 10 19.2 6 168. 123 3.92 3.44 18.3 1 0
#> # … with 22 more rows
Would it be so bad to just use tibble::add_column()
to add .requirement
and .row
columns to the front of the filtered data set without any name repair?
That seems like the simplest approach that still maintains a useful data structure. Since this is a somewhat ephemeral data frame, I'm not too worried about name collisions. Even if they happen, you can still print the data frame and take a look at the failure locations and ideally that's all you care about doing.
For a first pass I think it's fine to not worry about the scalar problem and to just use add_column()
Maybe think about removing some of the user friendly bits like enforce_last()
and encourage people to use things like pointblank if they need more completely handling of these ideas
We have decided not to pursue this further for now. It feels like we either create an API that goes too far for what dplyr should do, or too simple to the point that it isn't useful. If people need something like this, they should use a fully baked package like pointblank for it. If they need a helper that checks if a join key is valid, they can use dm::check_key()