S4Vectors Native support for groupInfo slot

I propose enabling native support for group information and submit a draft implementation.

One of the advantages of dplyr and tibble is support for groupwise operations. dplyr stores a tbl_df of group information; grouping columns with the non-empty combinations of levels (named by their respective column name in the data) and .rows which is a list of rownames to which the grouping results. e.g. in dplyr:

group_data(group_by(mtcars, am, gear))
# A tibble: 4 x 3
     am  gear .rows     
  <dbl> <dbl> <list>    
1     0     3 <int [15]>
2     0     4 <int [4]> 
3     1     4 <int [8]> 
4     1     5 <int [5]>

This approach has the drawback (in my opinion) that this listing of .rows must be updated with every operation which alters the number of rows (including [, filter, slice, rbind, etc... ). An alternative (and the approach used here) is to store only the names of the columns which should be involved in further groupwise operations.

str(DataFrame(mtcars[1:4,1:4], groupInfo = c("am", "gear")))
Formal class 'DFrame' [package "S4Vectors"] with 7 slots
  ..@ rownames       : chr [1:4] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive"
  ..@ nrows          : int 4
  ..@ groupInfo      : chr [1:2] "am" "gear"
  ..@ listData       :List of 4
  .. ..$ mpg : num [1:4] 21 21 22.8 21.4
  .. ..$ cyl : num [1:4] 6 6 4 6
  .. ..$ disp: num [1:4] 160 160 108 258
  .. ..$ hp  : num [1:4] 110 110 93 110
  ..@ elementType    : chr "ANY"
  ..@ elementMetadata: NULL
  ..@ metadata       : list()

I provide accessors and replacement functions for this slot, but leave any processing of the data to whichever code can make use of the information. I have added dplyr consistency only in the creation of a DataFrame from an already grouped tbl_df or conversion back to one

groupInfo(as(group_by(mtcars, am, gear), "DataFrame"))
[1] "am"   "gear"

This way, methods which act on individual columns of a DataFrame can delegate how to perform the groupwise action. Where this may be of significant benefit is e.g. plyranges (ping: @sa-lee) for which a GRanges column may be better suited to perform the groupwise operation internally rather than sliced into rows.

I have tested the implementation in this PR with a branch of DFplyr which aims to extend dplyr support to DataFrame: https://github.com/jonocarroll/DFplyr/tree/native_groupInfo and as far as I can tell this is successful.

This PR is intended as a draft and I welcome feedback. My implementation may not be ideal but hopefully serves as a starting point.

Feb 28 '20 14:02 jonocarroll

Thanks for keeping the ball rolling on this.

The reason they store the subscripts, of course, is that the grouping only needs to be computed once. In terms of downsides, besides the updating overhead that you mentioned, the subscripts also take up memory. If we did store the subscripts, we would want a compressed list, conceptually equivalent to the return value of base::grouping().

Feb 28 '20 21:02 lawremi

Oh my! I hope you realize that changing the internal representation of DFrame objects will break all the serialized instances that exist around so will probably break hundreds of Bioconductor packages and a few dozens CRAN packages. Plus of course many standalone scripts that people keep around just in case they need to re-run their analyses. So this cannot be made lightly ;-)

If you want to pursue that route you would need to check at least a dozen of other Bioconductor infrastructure packages that sit on top of S4Vectors (IRanges, GenomicRanges, Biostrings, SummarizedExperiment, etc... see S4Vectors/inst/doc/HTS_core_package_stack.txt) and provide PRs that fix the broken ones.

Note that a much less disruptive route would be to use a metadata column to keep track of the grouping columns:

DF <- DataFrame(mtcars[1:4, ])
mcols(DF)$.grouping <- colnames(DF) %in% c("am", "gear")

DF
# DataFrame with 4 rows and 11 columns
#                      mpg       cyl      disp        hp      drat        wt
#                <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
# Mazda RX4           21.0         6       160       110      3.90     2.620
# Mazda RX4 Wag       21.0         6       160       110      3.90     2.875
# Datsun 710          22.8         4       108        93      3.85     2.320
# Hornet 4 Drive      21.4         6       258       110      3.08     3.215
#                     qsec        vs        am      gear      carb
#                <numeric> <numeric> <numeric> <numeric> <numeric>
# Mazda RX4          16.46         0         1         4         4
# Mazda RX4 Wag      17.02         0         1         4         4
# Datsun 710         18.61         1         1         4         1
# Hornet 4 Drive     19.44         1         0         3         1

mcols(DF)
# DataFrame with 11 rows and 1 column
#      .grouping
#      <logical>
# mpg      FALSE
# cyl      FALSE
# disp     FALSE
# hp       FALSE
# drat     FALSE
# wt       FALSE
# qsec     FALSE
# vs       FALSE
# am        TRUE
# gear      TRUE
# carb     FALSE

Another benefit of this approach is that you don't need to modify extractCOLS() to make it aware of the new_groupInfo slot because extractCOLS() will do the right thing out-of-the box.

I don't think it's worth changing the interface of the DataFrame() constructor either. Having accessors to set/get the grouping columns should be enough.

About naming groupInfo is too vague. I would suggest being explicit about the fact that you are recording the grouping columns. This is important to avoid any confusion with the existing grouping mechanism used for CompressedList derivatives where grouping of the rows is recorded via a Partitioning object (the CompressedList class is implemented in the IRanges package).

About coding style

Please touch only the parts you need to touch so when we look at your PR we don't get distracted by your attempts at improving/reorganizing existing things that are not related to your proposal. So for example, instead of:

-         representation(
-                        rownames = "character_OR_NULL",
-                        nrows = "integer"
-                        ),
-         prototype(rownames = NULL,
-                   nrows = 0L,
-                   listData = structure(list(), names = character())),
+         slots = c(rownames = "character_OR_NULL",
+                   nrows = "integer",
+                   groupInfo = "character_OR_NULL"),

your change should just be:

         representation(
                        rownames = "character_OR_NULL",
-                       nrows = "integer"
+                      nrows = "integer",
+                      groupInfo = "character_OR_NULL"
                        ),

Really?

if (!is.null(groupInfo)) {
  groups <- groupInfo
} else {
  groups <- NULL
}

The way I generally solve this problem is by doing

groups <- groupInfo

Feb 28 '20 21:02 hpages

Thanks for the speedy feedback, and apologies for my lack of expertise in S4 constructs.

I was under the (evidently incorrect) belief from conversations with @lawremi (which I have clearly misunderstood) that a new slot could be added for the group information with a prototype of NULL and not break existing structures. Breakage of anything is not my intention at all; this should be an additional piece of stored information where it can be added but default to NULL for all other instances. Testing with an object serialized from the master branch does not work in this PR so clearly changes are required.

My (excessive) changes to the setClass were from trying to add a slots argument which does not play nice with the representation argument (deprecated from R3.0.0)

Argument "representation" cannot be used if argument "slots" is supplied

but going in this direction was purely due to my lack of understanding of how to extend the class.

I'm embarrassed that in my many refactorings of my attempt I overlooked a trivial simplification. I can only hope that you'll believe me if I say it didn't start out like that.

On to the meat of it - I'm fine with using mcols for this purpose. It seems to solve all the immediate issues. Part of the reason for looking to make a change in this repo was that I was hoping to find a canonical place to store the grouping information - my current master branch of DFplyr stores this in an attribute of listData but the two biggest issues with that were:

show does not know about this. Having a canonical place to store this information means show can be modified in this repo to check for it and display grouping information (as per this PR)
There was no enforced contract for anyone else to use this space to store grouping information, so it was unlikely to become a standard. This means everyone needs to write their own getters and setters.

I used the name groupInfo because originally I was going to follow dplyr's methodology of storing the unique values of the columns and the corresponding rows. With a reduction to just the columns I agree the name should change. I didn't use grouping as I intended the getter to be named the same and this would conflict with base::grouping.

I'll clean this PR up to a) use the simpler change to setClass and b) move "grouping" to mcols(x) then we can hopefully discuss further. Thank you.

Feb 28 '20 23:02 jonocarroll

Serialization should not be a problem, since we're adding an "optional" slot, but using the mcols()would be a good way to prototype these ideas before making changes that would be difficult to reverse.

Feb 28 '20 23:02 lawremi

@jonocarroll Sounds good. If you move "grouping" to mcols(x) you don't need to touch setClass at all.

@lawremi After adding a slot all the serialized instances become invalid. So it's serious business, even if the slot is "optional". In any case, an updateObject() method would be a must.

This is a good opportunity for me to stress again the importance of adding a mechanism in R that will allow us to automatically fix out-of-sync serialized S4 instances at load time. It's very easy to do.

Feb 29 '20 00:02 hpages

Implementation up to date in DFplyr: https://github.com/jonocarroll/DFplyr/tree/native_groupInfo

I've removed any changes to the object or constructors. I've added a bit of safety checking in how groupCols() updates. I'm not too fussed about the naming, so if these names aren't ideal we can change them.

Feb 29 '20 04:02 jonocarroll

@hpages How is the patch coming along for object updating? Good to hear that it's easy, because it was giving Gabe major headaches.

Feb 29 '20 15:02 lawremi

Good to hear that it's easy, because it was giving Gabe major headaches.

Well the approach I'm proposing is simple and should be straightforward to implement. Still waiting for the green light to start working on a patch. Sounds like maybe I should just do it and submit to the Bugzilla tracker.

Feb 29 '20 20:02 hpages

Hi @jonocarroll,

Looks like in the end you implemented the grouping feature and other dplyr-like features in DFplyr, which sounds like the way to go. I'm a big fan of this kind of compartmentalized approach where a dedicated package takes full responsibility for implementing/supporting a well-defined and well-focused set of features. I particularly appreciate the fact that the DataFrame implementation in S4Vectors can remain completely agnostic about it :wink:

One thing I noted is that in DFplyr you chose to pre-calculate the grouping and to store it in the metadata() part of the DataFrame, instead of just keeping track of the columns that participate to the grouping in the mcols() part of the DataFrame. Not sure what was the motivation for doing so but note that a major problem with this approach is that the pre-calculated grouping becomes stale as soon as one subsets the DataFrame or reorders its rows:

library(DFplyr)
library(S4Vectors)
m <- mtcars[, c("cyl", "hp", "am", "gear", "disp")]
d <- as(m, "DataFrame")
d$grX <- GenomicRanges::GRanges("chrX", IRanges::IRanges(1:32, width = 10))
d$grY <- GenomicRanges::GRanges("chrY", IRanges::IRanges(1:32, width = 10))
d$nl <- IRanges::NumericList(lapply(d$gear, function(n) round(rnorm(n), 2)))

dd <- group_by(d, cyl, am)
metadata(dd)
# $groups
#   cyl am        .rows
# 1   4  0     9, 21, 8
# 2   4  1 3, 20, 1....
# 3   6  0 11, 4, 6, 10
# 4   6  1     1, 2, 30
# 5   8  0 5, 13, 1....
# 6   8  1       29, 31

dd2 <- dd[20:11, ]
metadata(dd2)  # stale groups
# $groups
#   cyl am        .rows
# 1   4  0     9, 21, 8
# 2   4  1 3, 20, 1....
# 3   6  0 11, 4, 6, 10
# 4   6  1     1, 2, 30
# 5   8  0 5, 13, 1....
# 6   8  1       29, 31

This means that dd2 is basically a broken object. Any "group-aware" operation will fail on it:

tally(dd2, gear)
# Error: subscript contains out-of-bounds indices

Things get even worse if all the rows are kept but are reordered:

dd3 <- dd[32:1, ]
tally(dd3, gear)
# DataFrame with 6 rows and 3 columns
#         cyl        am         n
#   <numeric> <numeric> <numeric>
# 1         4         0         9
# 2         4         1        27
# 3         6         0        16
# 4         6         1        13
# 5         8         0        46
# 6         8         1         7

In this case, things seem to work but the result is wrong!

As previously discussed here, using a metadata column to mark the columns that participate to the grouping avoids this problem.

Anyways, is it ok to close this PR?

Thanks, H.

May 07 '25 06:05 hpages

Yes, this PR can be closed, but I'll link a new issue in DFplyr. Many thanks for the feedback.

May 07 '25 07:05 jonocarroll

S4Vectors S4Vectors copied to clipboard

Native support for groupInfo slot

S4Vectors
S4Vectors copied to clipboard