tidySingleCellExperiment Add conditional print method for objects that contain alternative experiments

Hi Stefano,

I've now made the printing conditional and have amended the show() unit test to assure that the default print method works. For the unit test I've added a random normcounts matrix to the existing pbmc_small sce object and named it ADT When printed to the console the header now looks like this:

   > pbmc_small
    `# A SingleCellExperiment-tibble abstraction: 80 × 17
    `# Features=230 | Cells=80 | Assays=counts, logcounts, ADT-normcounts

The print method takes any number of alternative experiments and appends the name to the assayNames contained therein, separated by a hyphen. I think this looks the tidiest. I hope you agree. If no alternative experiments exist, the print output is as before.

I'll start working on how to pull features from the alternative experiment slots into the tibble with join_features.

Aug 29 '23 06:08 Biomiha

Hi Stefano,

I've now made the printing conditional and have amended the show() unit test to assure that the default print method works. For the unit test I've added a random normcounts matrix to the existing pbmc_small sce object and named it ADT When printed to the console the header now looks like this:
   > pbmc_small
    `# A SingleCellExperiment-tibble abstraction: 80 × 17
    `# Features=230 | Cells=80 | Assays=counts, logcounts, ADT-normcounts
The print method takes any number of alternative experiments and appends the name to the assayNames contained therein, separated by a hyphen. I think this looks the tidiest. I hope you agree. If no alternative experiments exist, the print output is as before.

I'll start working on how to pull features from the alternative experiment slots into the tibble with join_features.

Amazing!

So if I can imagine a complex SCE with 3 expriments

base: counts, logcounts
ADT: counts, logcounts
Hashtag demultiplex: counts, logcounts

we would have

`# A SingleCellExperiment-tibble abstraction: 80 × 17
`# Features=230 | Cells=80 | Assays=counts, logcounts, ADT-counts, ADT-logcounts, Hashtag demultiplex-counts, Hashtag demultiplex-logcounts

Am I interpreting correctly?

Aug 29 '23 06:08 stemangiola

Yes, exactly. When I tested on a more complicated example I noticed that it creates a new line if the maximum line width is exceeded, but it still prints everything.

Aug 29 '23 07:08 Biomiha

Yes, exactly.

Your solution looks great, and very intuitive, especially if experiment names are short.

Just to explore all possibilities, could you think about a solution that avoids the repetition of experiment names, so as to increase the scalability of our printout to many long experiment names?

Aug 29 '23 07:08 stemangiola

Printing the name of the experiment instead, followed by the assayNames and separated by a semi-colon is easy enough. That would look something like:

     `# A SingleCellExperiment-tibble abstraction: 7,865 × 3
     `# Features=33538 | Cells=7865 | Assays=base: counts, logcounts; Antibody Capture: counts, logcounts; Hashtag demultiplex: counts, logcounts

However, like I said in my (edited) comment above, the native tibble print method will break lines if the character count gets too long (I can actually see a difference in how it's printed on my laptop vs. my wide-screen display). I think it's a matter of personal preference really. I find the colons and semi-colons a bit much and prefer the wider print but we can ask others and find a consensus. I also think we would need to find another name for base because it implies the base R vs. tidy R semantics, whereas here it's actually the main experiment and alternative experiments (and I find even this nomenclature pretty bad to be honest).

Aug 29 '23 07:08 Biomiha

I find the colons and semi-colons a bit much and prefer the wider print but we can ask others and find a consensus.

I like the idea. Let's present both an "easy" and "hard" scenario to the community.

I also think we would need to find another name for base because it implies the base R vs. tidy R semantics

Agree.

I think slack Bioc tidy is the best place where to ask our community. But feel free to use any tool.

Aug 29 '23 07:08 stemangiola

@Biomiha Thinking about it. You can leave it as is; maybe edit the example in the unit test to be the worst-case scenario. If it looks good in the worst-case scenario, there is no reason to change for the time being.

And instead, you can focus on the features such as join_feature, aggregate_cells, etc...

Aug 29 '23 12:08 stemangiola

Okay, I've now mocked up the expression matrices for the additional alternative experiments and added that to the unit tests. In my hands it works as expected. Will move on to joining features next.

Aug 30 '23 08:08 Biomiha

So, I think the join_features method is good to go. The way I envisaged it is that you can pick any feature for the long format and it automatically pulls the abundance for all the assays in that particular experiment. So if the features are from the ADT altExp it will just silently pull the values from all assays in the ADT altExp. The one thing I chose not to allow is the mixing of features from different experiments, which I think should be avoided for other reasons. For the wide shape, you obviously need to pick which assay to pull from and that makes things pretty straightforward. To avoid pulling assays with same name from different experiments I have chosen to use the same nomenclature that I used in the print method, e.g. logcounts, ADT-logcounts or HTO-logcounts. I have noticed however that there is a name.repair argument somewhere because at least for me features that have an illegal character get it changed to a "." in the wide format, for example my mock Ab features Ab-1 and Ab-2, etc... get changed to Ab.1, Ab.2, ... I think that could prove problematic because there are quite a few gene names that contain hyphens (like all mithchondrial genes).

Sep 02 '23 16:09 Biomiha

Well done!

The one thing I chose not to allow is the mixing of features from different experiments, which I think should be avoided for other reasons.

I would think we can allow this. Imagine plotting the abundance of a transcript and the relative protein. Or is a gene has the same number for the transcript and the protein, we should pull both.

I have noticed however that there is a name.repair argument somewhere because at least for me features that have an illegal character get it changed to a "." in the wide format, for example my mock Ab features Ab-1 and Ab-2, etc... get changed to Ab.1, Ab.2, ... I think that could prove problematic because there are quite a few gene names that contain hyphens (like all mithchondrial genes).

Can you please make an example? And why this would be problematic?

Please also have a look to the conflicts.

Sep 09 '23 14:09 stemangiola

I would think we can allow this. Imagine plotting the abundance of a transcript and the relative protein. Or is a gene has the same number for the transcript and the protein, we should pull both.

I am still in two minds about it to be honest. I have now added the aggregate_cells function and have added the same check there as well. In general the one thing that needs to be assumed when pulling data not only from different experiments but also from different assays is that it will be normalised differently. In my experience users tend to just use whatever is in front of them so that could lead to misinterpretation. If you feel very strongly that you'd want to allow it then I can tweak the code to enable that.

Can you please make an example? And why this would be problematic?

Example below:

library(tidySingleCellExperiment)
library(DropletTestFiles)
library(DropletUtils)
path <- getTestFile("tenx-3.0.0-pbmc_10k_protein_v3/1.0.0/filtered.tar.gz")
dir <- tempfile()
untar(path, exdir=dir)

sce <- read10xCounts(file.path(dir, "filtered_feature_bc_matrix"))
sce
sce <- splitAltExps(sce, rowData(sce)$Type)
rownames(sce) <- rowData(sce)$Symbol
sce |> 
  join_features(features = "MT-CO1", shape = "long")

sce |> 
  join_features(features = "MT-CO1", shape = "wide")

For me at least in the wide format I get:

.cell Sample Barcode MT.CO1

Notice the replacement of the - with the ..

Please also have a look to the conflicts.

Will do. Now that I am done with the backbone I will refactor the code to include the commits that were made in the interim.

Thanks

Sep 09 '23 16:09 Biomiha

(

can you please provide this info for the authorship, if you haven't done already?

Email	Institution	First Name	Middle Name(s)/Initial(s)	Last Name	Suffix	Corresponding Author	Home Page URL	Collaborative Group/Consortium	ORCiD

)

Sep 11 '23 00:09 stemangiola

I am still in two minds about it to be honest. I have now added the aggregate_cells function and have added the same check there as well. In general the one thing that needs to be assumed when pulling data not only from different experiments but also from different assays is that it will be normalised differently. In my experience users tend to just use whatever is in front of them so that could lead to misinterpretation. If you feel very strongly that you'd want to allow it then I can tweak the code to enable that.

A good approach here is to see what tidyseurat does when we have multiomics, for example, with RNA and ADT. Would you be able to get an example dataset with ADT (e.g. azimuth PBMC I believe) and test your use cases on that? For example getting a transcript and a protein with the same query, or getting a feature ID that is the same for transcript and protein.

Notice the replacement of the - with the ..

I believe is due to the Matrix of SCE that does not allow -. As we load the feature info there, I am not sure we can do much about that. Maybe print a warning message? So the user does not waste 30 minutes trying to debug this behaviour?

Sep 11 '23 00:09 stemangiola

A good approach here is to see what tidyseurat does when we have multiomics, for example, with RNA and ADT. Would you be able to get an example dataset with ADT (e.g. azimuth PBMC I believe) and test your use cases on that? For example getting a transcript and a protein with the same query, or getting a feature ID that is the same for transcript and protein.

That's a good shout actually. Looking at it (please correct me if I am wrong), it seems that tidyseurat adds 2 columns called .abundance_RNA and .abundance_ADT and that seems to work fine (albeit with NAs everywhere) even when mixing features from different experiment types (or assays), whereas it seems to drop features from the ADT "assay" silently. If I generate the cbmc object from https://satijalab.org/seurat/articles/multimodal_vignette, running the following:

cbmc |> 
  join_features(features = c("A1BG-AS1", "CD3"), shape = "wide")

will only show A1BG-AS1.

Also in the tidyseurat print method the dash (-) is preserved whereas the output of

as.SingleCellExperiment(cbmc) |> 
  join_features(features = c("A1BG-AS1"), shape = "wide")

it is changed to a dot (.) . I don't actually think it is to do with the matrix colnames. When I stepped into the get_abundance_sc_wide function the output contained the dash (-) but when part of the join_features function it is changed to a dot.

Sep 12 '23 07:09 Biomiha

will only show A1BG-AS1.

That's definitely a bug :)

I don't actually think it is to do with the matrix colnames. When I stepped into the get_abundance_sc_wide function the output contained the dash (-) but when part of the join_features function it is changed to a dot.

Can you find which command is responsible?

I think for shape="long" we can copy what Seurat does, and as you proposed. NAs are fine. When shape = "wide" I believe we need to specify the assay, and we can encode (experiment + assay) the same way we print in the tibble.

Is anything else pending for discussion for the implementation?

Sep 12 '23 08:09 stemangiola

Hi Stefano,

So I think I've managed to get the functions to work with any selected assay. I noticed that the aggregate_cells method had been refactored (yay!!) so I have used the new method and added the functionality to call assays from the altExps slots. Now the only thing that I think is outstanding is that when selecting all=TRUE in join_features it returns NULL. Not sure why.

To answer your question

Can you find which command is responsible?

It's the left_join here:

 suppressMessages({
      .data <-
        .data %>%
        left_join(                            # <= this one here
          by=c_(.data)$name,
          get_abundance_sc_long(
            .data=.data,
            features=features,
            all=all,
            exclude_zeros=exclude_zeros, 
            ...)) %>%
        select(!!c_(.data)$symbol, .feature,
               contains(".abundance"), everything())
    })

Oct 01 '23 15:10 Biomiha

I have also made a stab at refactoring the aggregate_cells method. Was there a particular reason to turn the output into a SummarizedExperiment object? I can see that the equivalent method in the tidyseurat outputs a tibble instead.

Oct 23 '23 20:10 Biomiha

I have also made a stab at refactoring the aggregate_cells method. Was there a particular reason to turn the output into a SummarizedExperiment object? I can see that the equivalent method in the tidyseurat outputs a tibble instead.

I think it is elegant to switch from a single cell to a bulk container in Bioconductor, with the tidy interface bridging the two interfaces.

For tidyseurat, I don't think it is good to add SummarizedExperiment dependency, so it will not rely on bioconductor. However a seurat container could be used to store pseudobulk.

Oct 23 '23 20:10 stemangiola

For me personally it feels slightly counterintuitive to end up with a non-tidy output but I understand the concept and have amended the aggregate_cells function to output a SummarizedExperiment object. How would you like to handle the naming of the groups in cases where there are multiple selection columns (e.g. groups and ident)? For now the separator is "___" by default. Once we've finalised that I will update the unit test and let you know.

Oct 25 '23 11:10 Biomiha

For me personally it feels slightly counterintuitive to end up with a non-tidy output but I understand the concept and have amended the aggregate_cells function to output a SummarizedExperiment object. How would you like to handle the naming of the groups in cases where there are multiple selection columns (e.g. groups and ident)? For now the separator is "___" by default. Once we've finalised that I will update the unit test and let you know.

there are adapters for singleCellExperiment and summarizedExperiment, so we keep tidy all the way.

Oct 25 '23 13:10 stemangiola

Do I need to use any other coercion method, other than SummarizedExperiment? Currently my output looks like this:

> df |> 
    aggregate_cells(.sample = c(groups, ident), assays = c("ADT-logcounts", "logcounts"))
$logcounts
class: SummarizedExperiment 
dim: 230 6 
metadata(0):
assays(1): ''
rownames(230): MS4A1 CD79B ... SPON2 S100B
rowData names(0):
colnames(6): g1___0 g1___1 ... g2___1 g2___2
colData names(0):

$`ADT-logcounts`
class: SummarizedExperiment 
dim: 25 6 
metadata(0):
assays(1): ''
rownames(25): Ab-1 Ab-2 ... Ab-24 Ab-25
rowData names(0):
colnames(6): g1___0 g1___1 ... g2___1 g2___2
colData names(0):

Oct 25 '23 14:10 Biomiha

Can a SummarizedExperiment have alternative experiment slot? If not ..

To keep consistent and always output a SummarizedExperiment rather than a list, I would use this strategy in case we have alternative experiments included

... |>
# Reshape to make RNA and ADT both features
        pivot_longer(
          cols = c(RNA, ADT), # dynamic assays
          names_to = "data_source",
          values_to = "count"
        ) |>
        filter(!count |> is.na()) |>
        
        # Preserve the original gene name for interpretation/dowanstream aalyses
       mutate(!!as.symbol(glue("{f_()$name}_original"))) :=  feature) |>

       # Make unique for SE container
       unite( !!f_()$symbol, c(feature, data_source), remove = FALSE) |>
        
        # Covert
        tidybulk::as_SummarizedExperiment(
          .sample = .sample, # these should be replaced with the dynamic gaming established in utilities.R
          .transcript = .feature,
          .abundance = count
        )

With a big message tidySingleCellExperiment says: ....

A user I will able to filter based on data_source == "RNA" for example

Oct 26 '23 03:10 stemangiola

Do you think I'd need to change the tidybulk::as_SummarizedExperiment function to allow extra columns as well? Not necessarily something I want to do but I've noticed that it behaves strangely if multiple data sources are used. The function itself only takes 3 arguments, .sample, .transcript and .abundance. If I add the data_source it sometimes pushes it to the rowData slot but not every time and for some reason it only takes one type of data_source, whereas I can see in the tibble that I use as input into the tidybulk::as_SummarizedExperiment function that they are all there :confused: . The current commit should contain what you suggested but I am not 100% happy with its behaviour.

Oct 26 '23 16:10 Biomiha

Just to elaborate, it seems that only when RNA and ADT are combined the output looks different (and not correct; the features are truncated to RNA features only) - last example below:

df |> 
  aggregate_cells(.sample = groups, assays = c("logcounts"))

class: SummarizedExperiment 
dim: 230 2 
metadata(0):
assays(1): logcounts
rownames(230): ACAP1 ACRBP ... ZNF330 ZNF76
rowData names(0):
colnames(2): g1 g2
colData names(2): groups data_source
  
df |> 
  aggregate_cells(.sample = c(groups, ident), assays = "logcounts")

class: SummarizedExperiment 
dim: 230 6 
metadata(0):
assays(1): logcounts
rownames(230): ACAP1 ACRBP ... ZNF330 ZNF76
rowData names(0):
colnames(6): g1___0 g1___1 ... g2___1 g2___2
colData names(3): groups ident data_source
      
df |> 
  aggregate_cells(.sample = c(groups, ident), assays = c("counts", "logcounts"))

class: SummarizedExperiment 
dim: 230 6 
metadata(0):
assays(2): counts logcounts
rownames(230): ACAP1 ACRBP ... ZNF330 ZNF76
rowData names(0):
colnames(6): g1___0 g1___1 ... g2___1 g2___2
colData names(3): groups ident data_source

df |> 
  aggregate_cells(.sample = c(groups, ident), assays = c("ADT-counts", "ADT-logcounts"))

class: SummarizedExperiment 
dim: 25 6 
metadata(0):
assays(2): ADT-counts ADT-logcounts
rownames(25): Ab-1 Ab-10 ... Ab-8 Ab-9
rowData names(0):
colnames(6): g1___0 g1___1 ... g2___1 g2___2
colData names(3): groups ident data_source

df |> 
  aggregate_cells(.sample = c(groups, ident), assays = c("ADT-logcounts", "logcounts"))

class: SummarizedExperiment 
dim: 255 6 
metadata(0):
assays(2): logcounts ADT-logcounts
rownames(255): ACAP1 ACRBP ... ZNF330 ZNF76
rowData names(1): data_source
colnames(6): g1___0 g1___1 ... g2___1 g2___2
colData names(2): groups ident

Do you happen to know what could be causing this strange behaviour?

Oct 27 '23 10:10 Biomiha

I am not sure I see the problem could you point that to me? 230 + 25 = 255

Oct 29 '23 03:10 stemangiola

You're quite right, apologies, it doesn't truncate the features, it orders them somehow (?), hence why the ADT ones were hidden from the output (as well as in the head and tail, when I checked those). The one thing that still does not look consistent is that the last example in my list is the only one where the data_source is in the rowData slot, whereas in the rest it's in the colData slot. I would almost favour everything being in the rowData slot instead. What do you think?

Oct 29 '23 08:10 Biomiha

You're quite right, apologies, it doesn't truncate the features, it orders them somehow (?), hence why the ADT ones were hidden from the output (as well as in the head and tail, when I checked those). The one thing that still does not look consistent is that the last example in my list is the only one where the data_source is in the rowData slot, whereas in the rest it's in the colData slot. I would almost favour everything being in the rowData slot instead. What do you think?

colData includes info that are about samples, row data about genes. The dimensions should match with samples and genes, so its not about us to decide.

not sure about adt feature desapearing :(

Oct 29 '23 22:10 stemangiola

Well I suppose just the fact that it sticks the data_source column into either the colData or the rowData based on the dimensions to me would suggest it's not very robust :). I have now pushed all the changes (including the unit tests) to the altExp_methods branch. If you are happy with how it works I think it can be pulled into the master branch.

Oct 31 '23 08:10 Biomiha

Well I suppose just the fact that it sticks the data_source column into either the colData or the rowData based on the dimensions to me would suggest it's not very robust :).

I would say, since you are adding the column internally, please add the column manually to rowData.
A small detail. Let's be very verbose in the messages (please tell me if I am misunderstanding)

"tidySingleCellExperiment says: The selected assays have overlapping feature names. The feature names have been combined with the selected assay names (stored in the column assay), to keep the rownames of the SingleCellExperiment unique. The original feature names are stored in the column XXX"

Could you also confirm that this message is printed only when multiple Experiments & with overlapping gene names?
Sorry, I'm thinking, would it be better to call the data_source column assay?

Oct 31 '23 11:10 stemangiola

I would say, since you are adding the column internally, please add the column manually to rowData.

Done.

A small detail. Let's be very verbose in the messages

Done. The warning now reads: tidySingleCellExperiment says: The selected assays have overlapping feature names. The feature names have been combined with the selected assay_type, to keep the rownames of the SingleCellExperiment unique. You can find the original feature names in the orig.feature.names column of the rowData slot of your object.

Could you also confirm that this message is printed only when multiple Experiments & with overlapping gene names?

It is now :smiley: (I misunderstood what you meant before).

Sorry, I'm thinking, would it be better to call the data_source column assay?

I have changed it to assay_type. I think the name is self-explanatory and still distinct from the word assay, which in SingleCellExperiment world refers to counts, logcounts, etc... i.e. even data transformations.

Oct 31 '23 22:10 Biomiha

Amazing, we are getting close.

I submitted a review with comments. After this, only two thing will remain

I will review the code more generally for robustness and style
You will have to pass the unit tests (I believe they are failing at the moment). Are they gaining locally for you as well?

Oct 31 '23 22:10 stemangiola

tidySingleCellExperiment tidySingleCellExperiment copied to clipboard

Add conditional print method for objects that contain alternative experiments

tidySingleCellExperiment
tidySingleCellExperiment copied to clipboard