dplyr icon indicating copy to clipboard operation
dplyr copied to clipboard

`dplyr_col_modify()` and recycling

Open lionel- opened this issue 4 years ago • 7 comments

I think this call should be in the generic before dispatch, rather than in the data.frame method:

  cols <- vec_recycle_common(!!!cols, .size = nrow(data))

Too late for 1.0 but worth considering for 1.1.

lionel- avatar May 26 '20 07:05 lionel-

It might be easier for subclasses to call NextMethod() in their dplyr_col_modify() and dplyr_row_slice() methods if these were not calling dplyr_reconstruct() at the end.

Maybe the reconstructing should be done in the default method instead:

#' @export
dplyr_col_modify.default <- function(data, cols) {
  out <- dplyr_col_modify.data.frame(data, cols)
  dplyr_reconstruct(out, data)
}

Edit: Hmm that won't work, the default method will never be called of course. Not sure how to set that up. Maybe by inspecting .Class to figure out whether we're called from NextMethod().

Edit2: Related, nest_join() calls dplyr_reconstruct() after dplyr_col_modify(). It is thus called twice with data frames. Edit3: Ah no, col-modify is called with a tibble in this function, so it's still needs to be reconstructed. May need a comment to make this clearer.

lionel- avatar May 26 '20 07:05 lionel-

Is this still needed @lionel- ?

romainfrancois avatar May 19 '21 07:05 romainfrancois

The first problem is whether we want to enforce some properties for all classes, such as recycling. This would have the benefit of consistency and of requiring less work from methods. To do this, we could move recycling in the generic.

The other issue is that currently the data frame method is doing a lot of important work but IIUC there is no easy way to call it from another method because it's doing all the reconstructing. This prevents this pattern:

  1. Pre work/checks
  2. NextMethod (eventually hits the data frame method)
  3. Post work

Now maybe the third step isn't needed in practice, i.e. the first and second steps are sufficient to create a data frame that is ready to be reconstructed. In this case reconstructing in the data frame method is fine.

These issues make sense to me but I have never implemented tibble subclasses so I might be overthinking it. WDYT @DavisVaughan?

In any case feel free to close this for now if this issue is in the way, and maybe link to it from a meta-issue about genericity if there are other similar items?

lionel- avatar May 19 '21 08:05 lionel-

I have never actually needed dplyr_col_modify() (dplyr_reconstruct() was always enough for me), but tsibble does. They call the data frame method by converting the tsibble input to a tibble and re calling dplyr_col_modify(). Then they patch up the result after the fact with their own requirements. https://github.com/tidyverts/tsibble/blob/ba2053e32d32f9ef23f6bf95df7da9d2106be6de/R/dplyr-verbs.R#L219-L220

DavisVaughan avatar May 19 '21 12:05 DavisVaughan

As for setting it up, like mentioned above, could we do something like this? Like we do in dplyr_reconstruct() and in vctrs?

dplyr_col_modify <- function(data, cols) {
  out <- dplyr_col_modify_dispatch(data, cols)
  return(dplyr_reconstruct(out, data))
  UseMethod("dplyr_col_modify") # appease R CMD check
}

dplyr_col_modify_dispatch <- function(data, cols) {
  UseMethod("dplyr_col_modify")
}

dplyr_col_modify.data.frame <- function(data, cols) {
  # ...
  # would no longer call `dplyr_reconstruct()` at the end
}

I do see how it would be useful to separate the guts of dplyr_col_modify.data.frame() from the dplyr_reconstruct() call at the end of it, as that would make it easier for packages like tsibble to call NextMethod() because they could expect the result to be a bare data frame that they can continue to work on, rather than something that has been attempted to be reconstructed to their class already. I do think it is important to automatically call dplyr_reconstruct() once at the end of the generic though, because I have a feeling that is easy to forget.

DavisVaughan avatar May 19 '21 12:05 DavisVaughan

yup I like that setup. It also gives flexibility to apply some operations, such as recycling, in the generic rather than in the data frame method. With this setup though, there is less reason to worry about it because most methods would call NextMethod() and get the recycling anyway.

lionel- avatar May 19 '21 13:05 lionel-

If we make this change, are we going to need to get subclasses to update their methods? If so, I think that will push this over the line of too much work for not enough gain.

hadley avatar Aug 01 '22 13:08 hadley

I've skimmed current dplyr_col_modify() implementations and the usual approach is to wrap around NextMethod() which is perfectly fine.

I think it'd help some implementations to recycle in the generic. On the other hand it seems that some methods act as proxies and in this case recycling in the generic won't work as expected. I'll just close this since it's not worth any breaking change.

lionel- avatar Sep 27 '22 13:09 lionel-