SummarizedExperiment icon indicating copy to clipboard operation
SummarizedExperiment copied to clipboard

derived RSE to SE coercion discards names

Open LTLA opened this issue 6 years ago • 15 comments

Coercing a subclass of an RSE to an SE discards the row names:

library(SummarizedExperiment)
example(SummarizedExperiment, echo=FALSE)

setClass("MyExperiment", contains="RangedSummarizedExperiment")
me <- new("MyExperiment", se)

rownames(me) <- paste0("GENE_", seq_len(nrow(me)))
se2 <- as(me, 'SummarizedExperiment')
rownames(se2)
## NULL

This doesn't seem to happen with just the RSE to SE coercion, oddly enough.

rownames(se) <- paste0("GENE_", seq_len(nrow(se)))
se3 <- as(se, 'SummarizedExperiment')
rownames(se3)
## Stuff comes out
Session information
R version 3.6.0 Patched (2019-05-10 r76483)
Platform: x86_64-apple-darwin17.7.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /Users/luna/Software/R/R-3-6-branch-dev/lib/libRblas.dylib
LAPACK: /Users/luna/Software/R/R-3-6-branch-dev/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] SummarizedExperiment_1.15.5 DelayedArray_0.11.4
 [3] BiocParallel_1.19.0         matrixStats_0.54.0
 [5] Biobase_2.45.0              GenomicRanges_1.37.14
 [7] GenomeInfoDb_1.21.1         IRanges_2.19.10
 [9] S4Vectors_0.23.17           BiocGenerics_0.31.5

loaded via a namespace (and not attached):
 [1] lattice_0.20-38        bitops_1.0-6           grid_3.6.0
 [4] zlibbioc_1.31.0        XVector_0.25.0         Matrix_1.2-17
 [7] tools_3.6.0            RCurl_1.95-4.12        compiler_3.6.0
[10] GenomeInfoDbData_1.2.1

LTLA avatar Jul 23 '19 21:07 LTLA

I can confirm that I hit this issue with my objects that extend RSE and also with SCE, when coercing back to SE. Here's what I'm currently using as a workaround, but this would be great to have fixed in the SummarizedExperiment package in Bioc 3.10:

https://github.com/acidgenomics/transformer/blob/master/R/coerce-SummarizedExperiment-S3methods.R

mjsteinbaugh avatar Jul 26 '19 12:07 mjsteinbaugh

@LTLA See also issue #13

mjsteinbaugh avatar Jul 26 '19 12:07 mjsteinbaugh

Curious about the comment in that file:

We can't set S4 coercion methods here because RangedSummarizedExperiment and SummarizedExperiment aren't local, modifiable.

Classes shouldn't need to be local nor modifiable to set methods on them. Perhaps I'm misunderstanding?

lawremi avatar Jul 26 '19 16:07 lawremi

@lawremi I hit that error attempting to update the S4 coerce (for as()) methods, which I hadn't seen before, so I'm using the S3 as.SummarizedExperiment() as a workaround until we can update the SummarizedExperiment package. I'll see if I can reproduce it and I'll post the error here in the thread.

mjsteinbaugh avatar Jul 26 '19 16:07 mjsteinbaugh

Ok. Redefining the coerce() methods is probably not a great idea, but I was just surprised that you would get an error about the classes.

lawremi avatar Jul 26 '19 17:07 lawremi

Some detective work indicates that this is caused by an auto-generated coerce method:

> selectMethod("coerce", c("MyExperiment", "SummarizedExperiment"))
Method Definition:

function (from, to = "SummarizedExperiment", strict = TRUE)
if (strict) {
    from <- {
        class(from) <- "RangedSummarizedExperiment"
        from
    }
    {
        value <- new("SummarizedExperiment")
        for (what in c("colData", "assays", "NAMES", "elementMetadata",
        "metadata")) slot(value, what) <- slot(from, what)
        value
    }
} else from
<bytecode: 0x7fe66c5f6c38>
<environment: namespace:SummarizedExperiment>

Signatures:
        from           to
target  "MyExperiment" "SummarizedExperiment"
defined "MyExperiment" "SummarizedExperiment"

At least, I assume it's auto-generated, this doesn't show up anywhere in the SE codebase.

The problem is presumably that elementMetadata is empty when the rowData for an RSE subclass is stored in the mcols of the rowRanges. This is handled by the RSE -> SE coerce but not by the auto-generated methods that coerce RSE subclasses towards an SE.

It seems that much suffering could be alleviated by having the rowData for an RSE live in the elementMetadata, and then slapping it onto the rowRanges on request. If users assign a value to rowRanges, just overwrite the elementMetadata with mcols(value). All of the class conversions would then work as expected and users would be none the wiser.

Right now, all RSE subclass authors are obliged to write setAs methods to handle the coercion back to an SE. That seems like something we should be getting for free.

LTLA avatar Apr 15 '20 01:04 LTLA

This was considered but IIRC was discarded because of performance considerations. This was at least 7-8 years ago so I don't remember the details but I think that having the rowRanges() getter slap the elementMetadata onto the rowRanges each time it's called turned out to be somewhat costly at the time, especially in the particular case of VCF objects when used in typical workflows. I wish I had kept track of what guided the current design (and I generally try to keep these things in the form of comments in my code) but I don't seem to be able to find anything in that case, sorry.

Anyway, we can always revisit this and re-evaluate the impact on performance. But it's not a light change.

In the mean time, yes, RangedSummarizedExperiment subclasses need to explicitly define a coercion method to SummarizedExperiment to avoid the broken automatic coercion method to get in the way. Should just be:

setAs("MyExperiment", "SummarizedExperiment",
    function(from) as(as(from, "RangedSummarizedExperiment"), "SummarizedExperiment")
)

This should probably be added to the Extending the SummarizedExperiment class vignette.

Maybe one way to alleviate the suffering with automatic coercion methods in general would be to revisit the mechanics behind them. For example since there is an explicit coercion method to go from RangedSummarizedExperiment to SummarizedExperiment, the automatic coercion method from MyExperiment to SummarizedExperiment should acknowledge that and use it internally instead of blindly copying slots.

hpages avatar Apr 15 '20 05:04 hpages

Just spent the last hour trying that. It's even worse than I expected, I can't override the setAs even if I wanted. Consider the following example (same as suggested above):

setAs("SingleCellExperiment", "SummarizedExperiment", function(from) {
    as(as(from, 'RangedSummarizedExperiment'), 'SummarizedExperiment')
})

This seems to break how S4 dispatches to methods of base classes. For example, the show method now outputs this nonsense:

> X <- SingleCellExperiment::SingleCellExperiment()
> X
An object of class "SingleCellExperiment"
Slot "int_elementMetadata":
DataFrame with 0 rows and 0 columns

Slot "int_colData":
DataFrame with 0 rows and 2 columns

Slot "int_metadata":
$version
[1] ‘1.9.4’

$spike_names
character(0)

$size_factor_names
character(0)


Slot "rowRanges":
GRangesList object of length 0:
<0 elements>

Slot "colData":
DataFrame with 0 rows and 0 columns

Slot "assays":
NULL

Slot "NAMES":
NULL

Slot "elementMetadata":
DataFrame with 0 rows and 0 columns

Slot "metadata":
list()

reducedDimNames(0):
altExpNames(0):

And the fun never ends:

> library(SingleCellExperiment)
> Y <- SingleCellExperiment(matrix(0, 10, 10))
> class(Y)
[1] "SingleCellExperiment"
attr(,"package")
[1] "SingleCellExperiment"
> Y[1:10,]
class: SummarizedExperiment 
dim: 10 10 
metadata(0):
assays(1): ''
rownames: NULL
rowData names(0):
colnames: NULL
colData names(0):

Twilight zone stuff. Running on R-devel, r78035, if that makes any difference.

LTLA avatar Apr 15 '20 06:04 LTLA

Ok I see. It looks like defining this coercion changes the class hierarchy. Before defining the coercion:

> showClass("SingleCellExperiment")
Class "SingleCellExperiment" [package "SingleCellExperiment"]

Slots:
                                                                
Name:           int_elementMetadata                  int_colData
Class:                    DataFrame                    DataFrame
                                                                
Name:                  int_metadata                    rowRanges
Class:                         list GenomicRanges_OR_GRangesList
                                                                
Name:                       colData                       assays
Class:                    DataFrame               Assays_OR_NULL
                                                                
Name:                         NAMES              elementMetadata
Class:            character_OR_NULL                    DataFrame
                                   
Name:                      metadata
Class:                         list

Extends: 
Class "RangedSummarizedExperiment", directly
Class "SummarizedExperiment", by class "RangedSummarizedExperiment", distance 2
Class "Vector", by class "RangedSummarizedExperiment", distance 3
Class "Annotated", by class "RangedSummarizedExperiment", distance 4
Class "vector_OR_Vector", by class "RangedSummarizedExperiment", distance 4

After defining the coercion:

> showClass("SingleCellExperiment")
Class "SingleCellExperiment" [package "SingleCellExperiment"]

Slots:
                                                                
Name:           int_elementMetadata                  int_colData
Class:                    DataFrame                    DataFrame
                                                                
Name:                  int_metadata                    rowRanges
Class:                         list GenomicRanges_OR_GRangesList
                                                                
Name:                       colData                       assays
Class:                    DataFrame               Assays_OR_NULL
                                                                
Name:                         NAMES              elementMetadata
Class:            character_OR_NULL                    DataFrame
                                   
Name:                      metadata
Class:                         list

Extends: 
Class "RangedSummarizedExperiment", directly
Class "SummarizedExperiment", directly, with explicit coerce
Class "Vector", by class "SummarizedExperiment", distance 2, with explicit coerce
Class "Annotated", by class "SummarizedExperiment", distance 3, with explicit coerce
Class "vector_OR_Vector", by class "SummarizedExperiment", distance 3, with explicit coerce

So it looks like the distance between SingleCellExperiment and SummarizedExperiment is no longer 2 but that now the former extends the latter directly (and "with explicit coerce"). Not too surprisingly such a change to the inheritance tree breaks things. In particular it seems that it breaks callNextMethod() which is why your show() method no longer works as expected (callNextMethod() is no longer able to find the next method so defaults to the method for S4 objects). Welcome to the dark side of the S4 class system!

Maybe @lawremi can provide some guidance?

hpages avatar Apr 15 '20 07:04 hpages

Will get to it after I fix a more critical bug.

lawremi avatar Apr 15 '20 23:04 lawremi

While that happens...

This was at least 7-8 years ago so I don't remember the details but I think that having the rowRanges() getter slap the elementMetadata onto the rowRanges each time it's called turned out to be somewhat costly at the time, especially in the particular case of VCF objects when used in typical workflows.

I don't know how it was back then, but it seems like now it's just a direct slot replacement for the elementMetadata. One could even disable the validity checks for extra performance, given that correct length is guaranteed by the coexistence of the DF and GR(L) in an SE in the first place.

I'd find it hard to believe that a slot replacement would be more expensive than the dimnames copying that assays() is doing routinely now, given that changing the dimnames for an ordinary matrix seems to create a new copy of the entire thing.

LTLA avatar Apr 16 '20 05:04 LTLA

Yes maybe. Best way to know is to try. Things have changed a lot in 8 years.

A little bit of history: 8 years ago we didn't have SingleCellExperiment. VCF objects were the big SummarizedExperiment derivatives of that time so they drove some design decisions when we refactored the SummarizedExperiment container. Of course we didn't want the refactoring to slow them down.

The original SummarizedExperiment container was a RangedSummarizedExperiment i.e. it had to have a GRanges or GRangesList object on it. The refactoring consisted in renaming it -> RangedSummarizedExperiment, introducing a lighter flavor of it (what is now SummarizedExperiment) i.e. a flavor that doesn't need to have a GRanges or GRangesList on it, and introducing the rowData() accessor. And the fun wouldn't have been complete if we didn't also have to move all the implementation from the GenomicRanges package to a new dedicated SummarizedExperiment package and fix hundreds of serialized SummarizedExperiment instances! It was so intense that 8 years later my brain still blackouts on some important details of what happened exactly ;-)

hpages avatar Apr 16 '20 09:04 hpages

Yes, I remember it well, there was all that SummarizedExperiment0 business.

Gee, 8 years. I was but a callow grad student. Brings back memories.

LTLA avatar Apr 17 '20 20:04 LTLA

Seems like just yesterday to me. But anyway, the methods package is behaving as designed in this situation. Calling setAs() when from derives from to results in a direct relationship defined by the coercion. This isn't documented anywhere AFAICT but it's clear the code was meant to work that way. setIs() is the weirdest part of S4, and that's saying something.

I like Aaron's solution of keeping the metadata in @elementMetadata instead of @rowRanges@elementMetadata. It seems way simpler and performance should not be an issue ever since the shallow copy patch.

lawremi avatar Apr 17 '20 22:04 lawremi

Thanks Michael. I kind of had a premonition that this was actually going to be considered a "feature".

It still doesn't really explain why this breaks callNextMethod() though. Mimicking the SummarizedExperiment/RangedSummarizedExperiment/SingleCellExperiment hierarchy with classes A, B, and C:

setClass("A", slots=c(stuff="ANY"))
setMethod("show", "A", function(object) cat("Hi, I'm the show#A method\n"))
setClass("B", contains="A")
setClass("C", contains="B")
setAs("C", "A", function(from) {class(from) <- "A"; from})
setMethod("show", "C", function(object) {
    callNextMethod()
    cat("Hi, I'm the show#C method\n")
})

Then I get:

new("C")
# An object of class "C"
# Slot "stuff":
# NULL
# 
# Hi, I'm the show#C method

which is what happens with SingleCellExperiment objects after defining the explicit coercion from SingleCellExperiment to SummarizedExperiment.

HOWEVER, the same inheritance tree can also be achieved by having C explicitly contain A from the very beginning:

setClass("A", slots=c(stuff="ANY"))
setMethod("show", "A", function(object) cat("Hi, I'm the show#A method\n"))
setClass("B", contains="A")
setClass("C", contains=c("B", "A"))
setMethod("show", "C", function(object) {
    callNextMethod()
    cat("Hi, I'm the show#C method\n")
})

But in this case show() works as expected i.e. callNexMethod() seems to be able to find the show#A method:

new("C")
# Hi, I'm the show#A method
# Hi, I'm the show#C method

hpages avatar Apr 18 '20 00:04 hpages