arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-14071: [R] Try to arrow_eval user-defined functions

Open dragosmg opened this issue 2 years ago • 2 comments

dragosmg avatar Aug 03 '22 15:08 dragosmg

https://issues.apache.org/jira/browse/ARROW-14071

github-actions[bot] avatar Aug 03 '22 16:08 github-actions[bot]

:warning: Ticket has not been started in JIRA, please click 'Start Progress'.

github-actions[bot] avatar Aug 03 '22 16:08 github-actions[bot]

This is very cool! It's the most important type of user-defined function because it's 100% translatable using Arrow kernels so it runs in parallel...a lot of applications will benefit from this!

Have you considered adding a registration step? If you do, you may be able to simplify some of this. The dream, of course, is to not require pre-registration at all, which will require an approach much like the one you've sketched out here, (i.e., preprocessing the expression).

library(dplyr, warn.conflicts = FALSE)
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.

register_user_binding <- function(name, f, env = rlang::caller_env()) {
  # copy the bindings environment because we don't want to set the parent
  # of the one-and-only official bindings environment
  bindings_env <- as.environment(as.list(arrow:::nse_funcs))
  parent.env(bindings_env) <- env
  environment(f) <- bindings_env
  
  # register for use in Arrow (non-agg)
  arrow:::register_binding(name, f, update_cache = TRUE)
  
  # in case this is a recursive function
  arrow:::register_binding(name, f, bindings_env)
  
  # so that the user can call this function, too (most Arrow bindings accept
  # regular input, too)
  invisible(f)
}

nchar2 <- register_user_binding("nchar2", function(x) {
  1 + nchar(x)
})

record_batch(my_string = "1234") %>%
  mutate(
    var1 = nchar(my_string),
    var2 = nchar2(my_string)) %>%
  collect()
#> # A tibble: 1 × 3
#>   my_string  var1  var2
#>   <chr>     <int> <dbl>
#> 1 1234          4     5

Created on 2022-08-15 by the reprex package (v2.0.1)

paleolimbot avatar Aug 15 '22 13:08 paleolimbot

Have you considered adding a registration step?

A registration step would definitely simplify this. There was had a brief conversation (some of which is captured in the capstone doc and the Jira) and I agree with @jonkeane's point of view. There are valid arguments for both, but I am leaning towards users not being required to pre-register a binding / function.

Happy to hear more thoughts, either here or on the design doc.

dragosmg avatar Aug 15 '22 15:08 dragosmg

I like the idea of explicit registration if it means less complexity to maintain and makes it more likely that a user will understand an error that occurs. It also nicely parallels the user-defined function usage (which also requires an explicit registration step) and forms a natural progression (try using a user-defined binding and if that doesn't work, just change it to a user-defined scalar function).

paleolimbot avatar Aug 15 '22 18:08 paleolimbot

I'm not wholly against requiring registration if it is significantly simpler or is the only way. But I'm not sure that requiring registration will necessarily help make errors easier to understand (we'll need to have some good, clear error messages when anything goes wrong here regardless of the path we take).

And I might even argue that having only one kind of registration (for the user-defined scalar function) sets that apart in a positive way (crafting one of those will take care + do things that are surprising compared to all of our other bindings, after all). IMO: people shouldn't (need to) think too much about how the bindings work, just use the code you're used to, and if it works, great, if it gives you a (clear) error that some function isn't native to arrow yet, you can respond to that as is. If we have two different types of registration we'll need to find a way to explain how they are different (and even with great docs explaining that difference, there will be a contingent of people who will wonder what's the difference between the hypothetical register_binding(...) and register_scalar_function(...).

But, again, if autoregistering is a sticking point (though looking through the code that's here it doesn't look like it is...), then sure we can go the registration route first. (or if all the autoregistration stuff is overly burdensome from a maintenance perspective too, of course).

jonkeane avatar Aug 16 '22 09:08 jonkeane

Thanks for this! One comment, so far. It would be nice to have an idea of which parts of this are working, and which parts still need work before they are ready (I see a handful of todos in the code, which is great, but also in a comment of "I've done these bits: ..., still working on these: ...")

All the examples listed in the PR description work. So far this works for:

  • mutate() with:

    • [x] simple expression (nchar2(my_string)) - contains a simple call to a user-defined function
    • [x] a slightly more complicated expression (1 + nchar2(my_string))
    • [x] multiple unknown calls in the same expression (to test the iteration) - nchar2(my_string) + nchar3(my_string)
    • [x] user function defined using namespacing 1 + nchar4(my_string)
  • still left to implement:

    • [x] nested definitions (travelling one level down in the call stack)

dragosmg avatar Aug 16 '22 09:08 dragosmg

@paleolimbot I like your suggestion in https://github.com/apache/arrow/pull/13789#issuecomment-1214992856. Even if we don't require the user to self-register, that function can really simplify my current logic (as we can use it instead of having to walk the AST).

I would still be in favour of registering automatically (that feels a bit more natural to me).

dragosmg avatar Aug 17 '22 15:08 dragosmg

That approach sounds good to me!

paleolimbot avatar Aug 17 '22 16:08 paleolimbot

Update:

  • user-defined functions work with mutate() and filter()
  • we can translate nested function definitions

Give both of these work and the approach is rather complex, by walking the AST, I will now go more in the direction suggested by @paleolimbot (in https://github.com/apache/arrow/pull/13789#issuecomment-1214992856) and refactor / simplify the implementation.

dragosmg avatar Aug 22 '22 12:08 dragosmg

Give both of these work and the approach is rather complex, by walking the AST, I will now go more in the direction suggested by @paleolimbot (in https://github.com/apache/arrow/pull/13789#issuecomment-1214992856) and refactor / simplify the implementation.

Could you say more about what you're thinking about doing here, there were a few suggestions in that comment | thread. Are you going to now require registration? Or something else?

jonkeane avatar Aug 22 '22 12:08 jonkeane

Give both of these work and the approach is rather complex, by walking the AST, I will now go more in the direction suggested by @paleolimbot (in #13789 (comment)) and refactor / simplify the implementation.

Could you say more about what you're thinking about doing here, there were a few suggestions in that comment | thread. Are you going to now require registration? Or something else?

I'm still aiming to not require registration. I was thinking about switching from translation + registration to registration only.

Currently my approach is to take:

nchar2 <- function(x) 1 + nchar(x)

walk its abstract syntax tree, translate it to

nchar2 <- function(x) call_binding("+", 1, call_binding("char", x))

and then register it + update .cache$functions.

However, the translation might not be necessary and complicates the implementation, as we only need to register the "nchar2" binding without actually translating its body for this to work. The reason it doesn't work in main is because the data mask has no knowledge of "nchar2". So updating the context of the evaluation (to be aware of the translation, but without actually translating the body) should work.

So my proposal is to pare back my current approach and see what is the least amount of complexity we can get away with and still have this work the way we want it.

Translating the body also highlighted the overlap between nse_funcs and .cache$functions. We could potentially reduce that complexity too - ARROW-17495, but I'm happy to leave that for another PR.

dragosmg avatar Aug 22 '22 13:08 dragosmg

Ah cool, that sounds good, thanks!

jonkeane avatar Aug 22 '22 13:08 jonkeane

The CI failure (Windows & R 3.6) seems to be related to https://github.com/apache/arrow/pull/13661.

dragosmg avatar Aug 22 '22 17:08 dragosmg

I'm closing this PR for now as it was handed over to me to take a look at, but I don't think I'm going to end up with time to work on this in the near future, and we're just having a bit of a tidy of PRs ahead of the upcoming release.

thisisnic avatar Oct 13 '22 14:10 thisisnic