plyranges icon indicating copy to clipboard operation
plyranges copied to clipboard

group mutate revamp

Open ppaxisa opened this issue 1 year ago • 15 comments
trafficstars

When operating on an ungrouped object, mutate will use methods from {plyranges} but when operating on a grouped object, mutate will rely on vanilla tidyverse by converting the object to a tibble. The result is then coerced back to DataFrame to update the metadata of the object. Of note, core columns can be used in this framework to create new metadata columns.

ppaxisa avatar Apr 06 '24 13:04 ppaxisa

thanks @ppaxisa!

I'm just doing a little package cleanup today, I plan to merge in your PR on Wednesday (i'm out of office Mon and Tue)

mikelove avatar Apr 07 '24 12:04 mikelove

Is this a breaking change when a user tries to mutate with a column that is an S4 class (one of the original reasons I didn't use base tidyverse for grouped mutate in plyranges, and why it was slow)? I would prefer an implementation that does allow for S4 columns but understand that maybe this is a rare use case and we are willing to make tradeoffs here. @lawremi do you have thoughts on this?

sa-lee avatar Apr 07 '24 23:04 sa-lee

I have not formally tested that but yes, I believe this would be a breaking change if attempting to use S4 columns on grouped object (S4 columns would still work on ungrouped object).

ppaxisa avatar Apr 08 '24 07:04 ppaxisa

I’ll can do testing on Wednesday (and add unit tests for this case). Maybe there is a solution so we deliver speed when we can but not break on S4. ~Ranges derived from TxDb and EnsDb are common and have S4 columns~

mikelove avatar Apr 08 '24 10:04 mikelove

@sa-lee Pierre-Paul and I are considering:

Upon group_by + mutate:

  • If metadata cols contains any S4 variable, do the original behavior but show a message that a faster operation is possible without S4 variables (either through select or downgrading)
  • Provide a convenience function for downgrading S4 columns
  • Provide the faster dplyr-based mutate only when S4 columns are not present

Thoughts?

mikelove avatar Apr 11 '24 22:04 mikelove

That sounds like a good approach to me. There are vectorised methods for a lot of S4 List classes that group_by() + summarize() takes advantage of, so maybe also worth looking into that code as well to see if anything can be adapted over.

sa-lee avatar Apr 11 '24 23:04 sa-lee

Had some things pop up so didn’t get a chance to implement anything but plan on it later this week or next.

Thinking this will go to next Bioc cycle which is fine — important to get this working well, given the broad impact.

mikelove avatar Apr 16 '24 19:04 mikelove

If it helps: can you give me some pointers to try and write the convenience function that checks if there are S4 columns in the DataFrame? Not sure where to start but I can spend some time to move forward on this.

ppaxisa avatar Apr 18 '24 11:04 ppaxisa

To put a concrete case to the current discussion:

https://gist.github.com/mikelove/d788831af3cf76de642ba03af7a0124b?permalink_comment_id=5042008#gistcomment-5042008

mikelove avatar Apr 30 '24 18:04 mikelove

I have time again to work on this (really!). I'll report back by end of week.

mikelove avatar Apr 30 '24 18:04 mikelove

So I think just a check here that there are no S4 columns is all we need. I can build out the rest.

I was trying to follow the logic of this function and got a little lost. Should it just be an if / else block? Because any(core) and any(!core) aren't complements.

https://github.com/ImmuneAxisa/plyranges/blob/mutate_grp_revamp/R/dplyr-mutate.R#L88-L125

Another ask: because the code went through many iterations, could you make a new branch with your final changes, and can you mark the regions with comments, e.g. # PPA grouped mutate speedup, 2024

mikelove avatar May 03 '24 19:05 mikelove

> d <- DataFrame(x=1:10, y=11:20, z=IntegerList(as.list(1:10)))
> lapply(d, isS4)
$x
[1] FALSE

$y
[1] FALSE

$z
[1] TRUE

> any(sapply(d, isS4))
[1] TRUE
> 

mikelove avatar May 03 '24 19:05 mikelove

Ok I'll review the decision tree for which functions to use, comment that, and create a new branch. This might take a little while, I have limited bandwidth the next 3 weeks. I guess I'll have to close this pull request and submit a new one?

ppaxisa avatar May 07 '24 13:05 ppaxisa

That would be awesome. No time pressure and thanks again for your effort on this.

You can just leave this PR open, and make a new one once you get to it?

mikelove avatar May 07 '24 18:05 mikelove

Just chiming into say I'm happy to review an PRs moving forward.

sa-lee avatar May 08 '24 23:05 sa-lee

It's been a while but I rewrote the groupe mutate to match previous remarks, notably avoid breaking changes when S4 columns are present in metadata. I've written a new version that uses plyranges methods when S4 columns are present and prints a message to inform user of potential speed up if coercion to S3:

mutate.GroupedGenomicRanges <- function(.data, ...) {
  
  dots <- set_dots_named(...)
  check_colnames(names(dots))
  core_cols <- names(dots) %in% c("start", "end", "width", "seqnames", "strand")
  
  # if any S4 columns in mcols use plyranges group mutate
  if(any(sapply(mcols(.data), isS4))) {
    message("metadata contains S4 columns, using plyranges group operations.\nYou can speed up group operations by coercing metadata columns to S3")
    return(mutate_grp(.data, dots))
  }
  # PPA grouped mutate speedup, 2025
  # if mutating core columns, call plyranges group for those
  if(any(core_cols)) {
    .data <- mutate_grp(.data, dots[core_cols])
  }
  # if only core columns in call, return object
  if(all(core_cols)) {
    return(.data)
  } else {
    # use vanilla dplyr group operations on metadata
    mutate_mcols_grp(.data, dots[!core_cols])
  }
}

For more details, the changes are done in a new branch here

I also added some tests to check that S4 columns are preserved if present. This version passes checks:

==> devtools::check(document = FALSE)

══ Building ════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Setting env vars:
• CFLAGS    : -Wall -pedantic -fdiagnostics-color=always
• CXXFLAGS  : -Wall -pedantic -fdiagnostics-color=always
• CXX11FLAGS: -Wall -pedantic -fdiagnostics-color=always
• CXX14FLAGS: -Wall -pedantic -fdiagnostics-color=always
• CXX17FLAGS: -Wall -pedantic -fdiagnostics-color=always
• CXX20FLAGS: -Wall -pedantic -fdiagnostics-color=always
── R CMD build ─────────────────────────────────────────────────────────────────────────────────────────────────────────────
✔  checking for file ‘/Volumes/workSD1/Pierre-Paul/CRCT/PPA_work/project/plyranges/DESCRIPTION’ ...
─  preparing ‘plyranges’: (421ms)
✔  checking DESCRIPTION meta-information ...
─  installing the package to build vignettes
✔  creating vignettes (56.4s)
─  checking for LF line-endings in source and make files and shell scripts (886ms)
─  checking for empty or unneeded directories
   Removed empty directory ‘plyranges/pkgdown’
─  building ‘plyranges_1.23.2.tar.gz’
   
══ Checking ════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Setting env vars:
• _R_CHECK_CRAN_INCOMING_REMOTE_               : FALSE
• _R_CHECK_CRAN_INCOMING_                      : FALSE
• _R_CHECK_FORCE_SUGGESTS_                     : FALSE
• _R_CHECK_PACKAGES_USED_IGNORE_UNUSED_IMPORTS_: FALSE
• NOT_CRAN                                     : true
── R CMD check ─────────────────────────────────────────────────────────────────────────────────────────────────────────────
─  using log directory ‘/Volumes/workSD1/Pierre-Paul/CRCT/PPA_work/project/plyranges.Rcheck’ (360ms)
─  using R version 4.4.1 (2024-06-14)
─  using platform: x86_64-apple-darwin20
─  R was compiled by
       Apple clang version 14.0.0 (clang-1400.0.29.202)
       GNU Fortran (GCC) 12.2.0
─  running under: macOS Ventura 13.6.9
─  using session charset: UTF-8
─  using options ‘--no-manual --as-cran’ (621ms)
✔  checking for file ‘plyranges/DESCRIPTION’
─  checking extension type ... Package
─  this is package ‘plyranges’ version ‘1.23.2’
─  package encoding: UTF-8
✔  checking package namespace information ...
✔  checking package dependencies (2.2s)
✔  checking if this is a source package
✔  checking if there is a namespace
✔  checking for executable files (2.4s)
✔  checking for hidden files and directories
✔  checking for portable file names ...
✔  checking for sufficient/correct file permissions
─  checking whether package ‘plyranges’ can be installed ... [36s/37s] OK (36.6s)
✔  checking installed package size ...
✔  checking package directory ...
N  checking for future file timestamps ...
   unable to verify current time
✔  checking ‘build’ directory
✔  checking DESCRIPTION meta-information ...
✔  checking top-level files
✔  checking for left-over files ...
✔  checking index information (360ms)
✔  checking package subdirectories (1.2s)
✔  checking code files for non-ASCII characters ...
✔  checking R files for syntax errors ...
✔  checking whether the package can be loaded (10s)
✔  checking whether the package can be loaded with stated dependencies (10s)
✔  checking whether the package can be unloaded cleanly (10s)
✔  checking whether the namespace can be loaded with stated dependencies (9.7s)
✔  checking whether the namespace can be unloaded cleanly (9.8s)
─  checking loading without being on the library search path ... [11s/11s] OK (11.3s)
✔  checking dependencies in R code (19.5s)
✔  checking S3 generic/method consistency (10s)
✔  checking replacement functions (10.4s)
✔  checking foreign function calls (10.6s)
─  checking R code for possible problems ... [48s/49s] OK (48.8s)
N  checking Rd files (606ms)
   checkRd: (-1) group_by-ranges.Rd:57: Lost braces in \itemize; meant \describe ?
   checkRd: (-1) group_by-ranges.Rd:58: Lost braces in \itemize; meant \describe ?
   checkRd: (-1) ranges-anchor.Rd:58: Lost braces in \itemize; meant \describe ?
   checkRd: (-1) ranges-anchor.Rd:59: Lost braces in \itemize; meant \describe ?
   checkRd: (-1) ranges-anchor.Rd:60: Lost braces in \itemize; meant \describe ?
   checkRd: (-1) ranges-anchor.Rd:61-62: Lost braces in \itemize; meant \describe ?
   checkRd: (-1) ranges-anchor.Rd:63-64: Lost braces in \itemize; meant \describe ?
✔  checking Rd metadata ...
✔  checking Rd line widths ...
✔  checking Rd cross-references (392ms)
✔  checking for missing documentation entries (10.1s)
✔  checking for code/documentation mismatches (28.8s)
✔  checking Rd \usage sections (9.8s)
✔  checking Rd contents ...
✔  checking for unstated dependencies in examples (391ms)
✔  checking installed files from ‘inst/doc’ (368ms)
✔  checking files in ‘vignettes’ (370ms)
─  checking examples ... [47s/48s] OK (48.2s)
✔  checking for unstated dependencies in ‘tests’ ...
─  checking tests (508ms)
    [42s/42s] OKthat.R’
   * checking for unstated dependencies in vignettes ... OK
   * checking package vignettes ... OK
   * checking re-building of vignette outputs ... [18s/19s] OK
   * checking for non-standard things in the check directory ... OK
   * checking for detritus in the temp directory ... OK
   * DONE
   
   Status: 2 NOTEs
   See
     ‘/Volumes/workSD1/Pierre-Paul/CRCT/PPA_work/project/plyranges.Rcheck/00check.log’
   for details.
   
── R CMD check results ─────────────────────────────────────────────────────────────────────────────── plyranges 1.23.2 ────
Duration: 6m 7.5s

❯ checking for future file timestamps ... NOTE
  unable to verify current time

❯ checking Rd files ... NOTE
  checkRd: (-1) group_by-ranges.Rd:57: Lost braces in \itemize; meant \describe ?
  checkRd: (-1) group_by-ranges.Rd:58: Lost braces in \itemize; meant \describe ?
  checkRd: (-1) ranges-anchor.Rd:58: Lost braces in \itemize; meant \describe ?
  checkRd: (-1) ranges-anchor.Rd:59: Lost braces in \itemize; meant \describe ?
  checkRd: (-1) ranges-anchor.Rd:60: Lost braces in \itemize; meant \describe ?
  checkRd: (-1) ranges-anchor.Rd:61-62: Lost braces in \itemize; meant \describe ?
  checkRd: (-1) ranges-anchor.Rd:63-64: Lost braces in \itemize; meant \describe ?

0 errors ✔ | 0 warnings ✔ | 2 notes ✖

R CMD check succeeded

I post here but maybe I should open a new pull request with this new branch? Let me know what you think!

ppaxisa avatar Mar 13 '25 12:03 ppaxisa

I kind of like the idea of opening a new PR with your new code, just to keep the git history simpler. You can mention #109 in a comment to link the two.

Thanks so much Pierre-Paul, I'll check it out ASAP. Let me also see about fixing GH Actions for plyranges, seems to be broken bc of something else.

mikelove avatar Mar 13 '25 13:03 mikelove