pins-r icon indicating copy to clipboard operation
pins-r copied to clipboard

could `pin_write()` support `type = "arrow"` uncompressed?

Open ijlyttle opened this issue 3 years ago • 7 comments

This may also impact pins for Python, but I thought I would start here.

The motivation for this question is that Arrow for JavaScript does not support compression. I know there's a workaround using pin_upload() and pin_download(), but it seems to add friction for R/Python users.

To be clear, I'm not suggesting that any default behavior change; just to ask about the possibility for something like a compression = FALSE argument somewhere.

Here's a larger writeup describing my motivation, with a first hack on how a BoardUrl could work using JavaScript.

Thanks!

ijlyttle avatar Aug 09 '22 16:08 ijlyttle

Ah, that's too bad about compression in js :/ . Since pins stores metadata for everything, it seems like having a strategy for specifying save/load options there could be useful..

Maybe related in long-term to https://github.com/rstudio/pins-python/issues/18

Right now arguments aren't passed to savers when someone does pin_write, so a quick fix would be adding a new type (e.g. "uncompressed_arrow"?)

machow avatar Aug 09 '22 17:08 machow

The way we handle the dots now from the user-facing function pin_write() passes them through to pin_store(), which means we can do things as outlined in #628 but we can't use the dots for this which happens in object_write().

I wonder if a new "uncompressed_arrow" type is perhaps the best way to go? Or should we try to support the different compress options that RDS has as well when considering this? We could add a new compression or compress argument to pin_write() after the dots without causing problems.

juliasilge avatar Aug 09 '22 20:08 juliasilge

Could something like write_args, a list with arguments to pass to the writing-function, work here? (I'm certain there's a better name)

object_write <- function(x, path, type = "rds", write_args = NULL) {
  type <- arg_match0(type, setdiff(object_types, "file"))

  switch(type,
    arrow = rlang::exec(arrow::write_feather, x, path, !!!write_args) 
    # you get the idea
  )

  path
}

This might present a problem where write_args wants to override an already-named argument, e.g.:

 json = jsonlite::write_json(x, path, auto_unbox = TRUE)

ijlyttle avatar Aug 10 '22 03:08 ijlyttle

That could be a great option, I think. This would still have to go after the dots to avoid a breaking change; the function signature would end up looking like:

pin_write(
  board,
  x,
  name = NULL,
  type = NULL,
  title = NULL,
  description = NULL,
  metadata = NULL,
  versioned = NULL,
  ...
  write_args = list()
)

juliasilge avatar Aug 10 '22 20:08 juliasilge

I am on board!

At the risk of being pedantic, I wonder if the argument should be named .write_args, as it comes after the dots - I think this is to help avoid collisions with arg names that could be sent using the dots.

Happy to start on a PR, if you like.

ijlyttle avatar Aug 11 '22 14:08 ijlyttle

I think adding the ability for people to customize how pins are saved will be really useful--but would prefer we separate out the immediate value of supporting uncompressed arrow, from the more high risk / high reward task of how to provide general extension for save / load.

Looking at tools like vetiver, it's clear that how things are saved and how they're loaded are coupled to each other. My concern is that addressing save on its own may work for self-describing formats (e.g. Rds, arrow), but I also expect customizing write to create new surprising behaviors.

If we could empower devs to steward the save/load process (e.g. create a new type that can do w/e it wants with something like write_args), it can help identify who is responsible when extension goes awry (e.g. error hit while using writing "some_extension_type", so we know it is not a pins responsibility to fix).

machow avatar Aug 11 '22 15:08 machow

Holding for now :)

ijlyttle avatar Aug 11 '22 15:08 ijlyttle

@ijlyttle Would you be open to doing a PR where we add a new type, perhaps called "custom" (what could be a better name?) here: https://github.com/rstudio/pins-r/blob/debc94529b8b73cda41ab50419d6276cd6c56d24/R/pin-read-write.R#L163

We would expect folks to write code like pin_write(board, x, "my-name", .f = arrow::write_feather(compression = "uncompressed")). The argument .f would be new, after the dots.

And then the entry in switch() for the new custom type: https://github.com/rstudio/pins-r/blob/debc94529b8b73cda41ab50419d6276cd6c56d24/R/pin-read-write.R#L126-L133 would be something like eval(rlang::call_modify(.f, x, path)). That seems to work pretty well:

library(rlang)

call <- quote(arrow::write_feather(compression = "uncompressed"))
path <- tempfile()
x <- mtcars
new_call <- call_modify(call, x, path)
eval(new_call)
head(arrow::read_feather(path))
#>    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Created on 2022-08-18 with reprex v2.0.2

We maybe should do a bit more careful checking of the call and use something like call_match(new_call, new_call[[1]])? And put a nice error message for users who pass in something that won't work.

What do you think?

juliasilge avatar Aug 19 '22 00:08 juliasilge

I like it! And yes, happy to propose a PR!

Just so I don't forget, I'd want to be able to use pin_read() and have it read it like an Arrow file. This may be a bit ambitious, but I figured I'd get the thought out.

ijlyttle avatar Aug 19 '22 11:08 ijlyttle

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

github-actions[bot] avatar Oct 27 '22 00:10 github-actions[bot]