tidySingleCellExperiment
tidySingleCellExperiment copied to clipboard
Add conditional print method for objects that contain alternative experiments
Hi Stefano,
I've now made the printing conditional and have amended the show() unit test to assure that the default print method works. For the unit test I've added a random normcounts matrix to the existing pbmc_small sce object and named it ADT
When printed to the console the header now looks like this:
> pbmc_small
`# A SingleCellExperiment-tibble abstraction: 80 × 17
`# Features=230 | Cells=80 | Assays=counts, logcounts, ADT-normcounts
The print method takes any number of alternative experiments and appends the name to the assayNames contained therein, separated by a hyphen. I think this looks the tidiest. I hope you agree. If no alternative experiments exist, the print output is as before.
I'll start working on how to pull features from the alternative experiment slots into the tibble with join_features.
Hi Stefano,
I've now made the printing conditional and have amended the
show()unit test to assure that the default print method works. For the unit test I've added a randomnormcountsmatrix to the existingpbmc_smallsce object and named itADTWhen printed to the console the header now looks like this:> pbmc_small `# A SingleCellExperiment-tibble abstraction: 80 × 17 `# Features=230 | Cells=80 | Assays=counts, logcounts, ADT-normcountsThe print method takes any number of alternative experiments and appends the name to the assayNames contained therein, separated by a hyphen. I think this looks the tidiest. I hope you agree. If no alternative experiments exist, the print output is as before.
I'll start working on how to pull features from the alternative experiment slots into the tibble with
join_features.
Amazing!
So if I can imagine a complex SCE with 3 expriments
- base: counts, logcounts
- ADT: counts, logcounts
- Hashtag demultiplex: counts, logcounts
we would have
`# A SingleCellExperiment-tibble abstraction: 80 × 17
`# Features=230 | Cells=80 | Assays=counts, logcounts, ADT-counts, ADT-logcounts, Hashtag demultiplex-counts, Hashtag demultiplex-logcounts
Am I interpreting correctly?
Yes, exactly. When I tested on a more complicated example I noticed that it creates a new line if the maximum line width is exceeded, but it still prints everything.
Yes, exactly.
Your solution looks great, and very intuitive, especially if experiment names are short.
Just to explore all possibilities, could you think about a solution that avoids the repetition of experiment names, so as to increase the scalability of our printout to many long experiment names?
Printing the name of the experiment instead, followed by the assayNames and separated by a semi-colon is easy enough. That would look something like:
`# A SingleCellExperiment-tibble abstraction: 7,865 × 3
`# Features=33538 | Cells=7865 | Assays=base: counts, logcounts; Antibody Capture: counts, logcounts; Hashtag demultiplex: counts, logcounts
However, like I said in my (edited) comment above, the native tibble print method will break lines if the character count gets too long (I can actually see a difference in how it's printed on my laptop vs. my wide-screen display).
I think it's a matter of personal preference really. I find the colons and semi-colons a bit much and prefer the wider print but we can ask others and find a consensus. I also think we would need to find another name for base because it implies the base R vs. tidy R semantics, whereas here it's actually the main experiment and alternative experiments (and I find even this nomenclature pretty bad to be honest).
I find the colons and semi-colons a bit much and prefer the wider print but we can ask others and find a consensus.
I like the idea. Let's present both an "easy" and "hard" scenario to the community.
I also think we would need to find another name for
basebecause it implies the base R vs. tidy R semantics
Agree.
I think slack Bioc tidy is the best place where to ask our community. But feel free to use any tool.
@Biomiha Thinking about it. You can leave it as is; maybe edit the example in the unit test to be the worst-case scenario. If it looks good in the worst-case scenario, there is no reason to change for the time being.
And instead, you can focus on the features such as join_feature, aggregate_cells, etc...
Okay, I've now mocked up the expression matrices for the additional alternative experiments and added that to the unit tests. In my hands it works as expected. Will move on to joining features next.
So, I think the join_features method is good to go. The way I envisaged it is that you can pick any feature for the long format and it automatically pulls the abundance for all the assays in that particular experiment. So if the features are from the ADT altExp it will just silently pull the values from all assays in the ADT altExp. The one thing I chose not to allow is the mixing of features from different experiments, which I think should be avoided for other reasons.
For the wide shape, you obviously need to pick which assay to pull from and that makes things pretty straightforward. To avoid pulling assays with same name from different experiments I have chosen to use the same nomenclature that I used in the print method, e.g. logcounts, ADT-logcounts or HTO-logcounts.
I have noticed however that there is a name.repair argument somewhere because at least for me features that have an illegal character get it changed to a "." in the wide format, for example my mock Ab features Ab-1 and Ab-2, etc... get changed to Ab.1, Ab.2, ... I think that could prove problematic because there are quite a few gene names that contain hyphens (like all mithchondrial genes).
Well done!
The one thing I chose not to allow is the mixing of features from different experiments, which I think should be avoided for other reasons.
I would think we can allow this. Imagine plotting the abundance of a transcript and the relative protein. Or is a gene has the same number for the transcript and the protein, we should pull both.
I have noticed however that there is a name.repair argument somewhere because at least for me features that have an illegal character get it changed to a "." in the wide format, for example my mock Ab features Ab-1 and Ab-2, etc... get changed to Ab.1, Ab.2, ... I think that could prove problematic because there are quite a few gene names that contain hyphens (like all mithchondrial genes).
Can you please make an example? And why this would be problematic?
Please also have a look to the conflicts.
I would think we can allow this. Imagine plotting the abundance of a transcript and the relative protein. Or is a gene has the same number for the transcript and the protein, we should pull both.
I am still in two minds about it to be honest. I have now added the aggregate_cells function and have added the same check there as well. In general the one thing that needs to be assumed when pulling data not only from different experiments but also from different assays is that it will be normalised differently. In my experience users tend to just use whatever is in front of them so that could lead to misinterpretation. If you feel very strongly that you'd want to allow it then I can tweak the code to enable that.
Can you please make an example? And why this would be problematic?
Example below:
library(tidySingleCellExperiment)
library(DropletTestFiles)
library(DropletUtils)
path <- getTestFile("tenx-3.0.0-pbmc_10k_protein_v3/1.0.0/filtered.tar.gz")
dir <- tempfile()
untar(path, exdir=dir)
sce <- read10xCounts(file.path(dir, "filtered_feature_bc_matrix"))
sce
sce <- splitAltExps(sce, rowData(sce)$Type)
rownames(sce) <- rowData(sce)$Symbol
sce |>
join_features(features = "MT-CO1", shape = "long")
sce |>
join_features(features = "MT-CO1", shape = "wide")
For me at least in the wide format I get:
.cell Sample Barcode MT.CO1
Notice the replacement of the - with the ..
Please also have a look to the conflicts.
Will do. Now that I am done with the backbone I will refactor the code to include the commits that were made in the interim.
Thanks
(
can you please provide this info for the authorship, if you haven't done already?
| Institution | First Name | Middle Name(s)/Initial(s) | Last Name | Suffix | Corresponding Author | Home Page URL | Collaborative Group/Consortium | ORCiD |
|---|
)
I am still in two minds about it to be honest. I have now added the aggregate_cells function and have added the same check there as well. In general the one thing that needs to be assumed when pulling data not only from different experiments but also from different assays is that it will be normalised differently. In my experience users tend to just use whatever is in front of them so that could lead to misinterpretation. If you feel very strongly that you'd want to allow it then I can tweak the code to enable that.
A good approach here is to see what tidyseurat does when we have multiomics, for example, with RNA and ADT. Would you be able to get an example dataset with ADT (e.g. azimuth PBMC I believe) and test your use cases on that? For example getting a transcript and a protein with the same query, or getting a feature ID that is the same for transcript and protein.
Notice the replacement of the
-with the..
I believe is due to the Matrix of SCE that does not allow -. As we load the feature info there, I am not sure we can do much about that. Maybe print a warning message? So the user does not waste 30 minutes trying to debug this behaviour?
A good approach here is to see what tidyseurat does when we have multiomics, for example, with RNA and ADT. Would you be able to get an example dataset with ADT (e.g. azimuth PBMC I believe) and test your use cases on that? For example getting a transcript and a protein with the same query, or getting a feature ID that is the same for transcript and protein.
That's a good shout actually. Looking at it (please correct me if I am wrong), it seems that tidyseurat adds 2 columns called .abundance_RNA and .abundance_ADT and that seems to work fine (albeit with NAs everywhere) even when mixing features from different experiment types (or assays), whereas it seems to drop features from the ADT "assay" silently.
If I generate the cbmc object from https://satijalab.org/seurat/articles/multimodal_vignette, running the following:
cbmc |>
join_features(features = c("A1BG-AS1", "CD3"), shape = "wide")
will only show A1BG-AS1.
Also in the tidyseurat print method the dash (-) is preserved whereas the output of
as.SingleCellExperiment(cbmc) |>
join_features(features = c("A1BG-AS1"), shape = "wide")
it is changed to a dot (.) .
I don't actually think it is to do with the matrix colnames. When I stepped into the get_abundance_sc_wide function the output contained the dash (-) but when part of the join_features function it is changed to a dot.
will only show
A1BG-AS1.
That's definitely a bug :)
I don't actually think it is to do with the matrix colnames. When I stepped into the get_abundance_sc_wide function the output contained the dash (-) but when part of the join_features function it is changed to a dot.
Can you find which command is responsible?
I think for shape="long" we can copy what Seurat does, and as you proposed. NAs are fine. When shape = "wide" I believe we need to specify the assay, and we can encode (experiment + assay) the same way we print in the tibble.
Is anything else pending for discussion for the implementation?
Hi Stefano,
So I think I've managed to get the functions to work with any selected assay. I noticed that the aggregate_cells method had been refactored (yay!!) so I have used the new method and added the functionality to call assays from the altExps slots.
Now the only thing that I think is outstanding is that when selecting all=TRUE in join_features it returns NULL. Not sure why.
To answer your question
Can you find which command is responsible?
It's the left_join here:
suppressMessages({
.data <-
.data %>%
left_join( # <= this one here
by=c_(.data)$name,
get_abundance_sc_long(
.data=.data,
features=features,
all=all,
exclude_zeros=exclude_zeros,
...)) %>%
select(!!c_(.data)$symbol, .feature,
contains(".abundance"), everything())
})
I have also made a stab at refactoring the aggregate_cells method. Was there a particular reason to turn the output into a SummarizedExperiment object? I can see that the equivalent method in the tidyseurat outputs a tibble instead.
I have also made a stab at refactoring the
aggregate_cellsmethod. Was there a particular reason to turn the output into aSummarizedExperimentobject? I can see that the equivalent method in the tidyseurat outputs a tibble instead.
I think it is elegant to switch from a single cell to a bulk container in Bioconductor, with the tidy interface bridging the two interfaces.
For tidyseurat, I don't think it is good to add SummarizedExperiment dependency, so it will not rely on bioconductor. However a seurat container could be used to store pseudobulk.
For me personally it feels slightly counterintuitive to end up with a non-tidy output but I understand the concept and have amended the aggregate_cells function to output a SummarizedExperiment object.
How would you like to handle the naming of the groups in cases where there are multiple selection columns (e.g. groups and ident)? For now the separator is "___" by default. Once we've finalised that I will update the unit test and let you know.
For me personally it feels slightly counterintuitive to end up with a non-tidy output but I understand the concept and have amended the
aggregate_cellsfunction to output aSummarizedExperimentobject. How would you like to handle the naming of the groups in cases where there are multiple selection columns (e.g. groups and ident)? For now the separator is "___" by default. Once we've finalised that I will update the unit test and let you know.
there are adapters for singleCellExperiment and summarizedExperiment, so we keep tidy all the way.
Do I need to use any other coercion method, other than SummarizedExperiment? Currently my output looks like this:
> df |>
aggregate_cells(.sample = c(groups, ident), assays = c("ADT-logcounts", "logcounts"))
$logcounts
class: SummarizedExperiment
dim: 230 6
metadata(0):
assays(1): ''
rownames(230): MS4A1 CD79B ... SPON2 S100B
rowData names(0):
colnames(6): g1___0 g1___1 ... g2___1 g2___2
colData names(0):
$`ADT-logcounts`
class: SummarizedExperiment
dim: 25 6
metadata(0):
assays(1): ''
rownames(25): Ab-1 Ab-2 ... Ab-24 Ab-25
rowData names(0):
colnames(6): g1___0 g1___1 ... g2___1 g2___2
colData names(0):
Can a SummarizedExperiment have alternative experiment slot? If not ..
To keep consistent and always output a SummarizedExperiment rather than a list, I would use this strategy in case we have alternative experiments included
... |>
# Reshape to make RNA and ADT both features
pivot_longer(
cols = c(RNA, ADT), # dynamic assays
names_to = "data_source",
values_to = "count"
) |>
filter(!count |> is.na()) |>
# Preserve the original gene name for interpretation/dowanstream aalyses
mutate(!!as.symbol(glue("{f_()$name}_original"))) := feature) |>
# Make unique for SE container
unite( !!f_()$symbol, c(feature, data_source), remove = FALSE) |>
# Covert
tidybulk::as_SummarizedExperiment(
.sample = .sample, # these should be replaced with the dynamic gaming established in utilities.R
.transcript = .feature,
.abundance = count
)
With a big message tidySingleCellExperiment says: ....
A user I will able to filter based on data_source == "RNA" for example
Do you think I'd need to change the tidybulk::as_SummarizedExperiment function to allow extra columns as well? Not necessarily something I want to do but I've noticed that it behaves strangely if multiple data sources are used. The function itself only takes 3 arguments, .sample, .transcript and .abundance. If I add the data_source it sometimes pushes it to the rowData slot but not every time and for some reason it only takes one type of data_source, whereas I can see in the tibble that I use as input into the tidybulk::as_SummarizedExperiment function that they are all there :confused: .
The current commit should contain what you suggested but I am not 100% happy with its behaviour.
Just to elaborate, it seems that only when RNA and ADT are combined the output looks different (and not correct; the features are truncated to RNA features only) - last example below:
df |>
aggregate_cells(.sample = groups, assays = c("logcounts"))
class: SummarizedExperiment
dim: 230 2
metadata(0):
assays(1): logcounts
rownames(230): ACAP1 ACRBP ... ZNF330 ZNF76
rowData names(0):
colnames(2): g1 g2
colData names(2): groups data_source
df |>
aggregate_cells(.sample = c(groups, ident), assays = "logcounts")
class: SummarizedExperiment
dim: 230 6
metadata(0):
assays(1): logcounts
rownames(230): ACAP1 ACRBP ... ZNF330 ZNF76
rowData names(0):
colnames(6): g1___0 g1___1 ... g2___1 g2___2
colData names(3): groups ident data_source
df |>
aggregate_cells(.sample = c(groups, ident), assays = c("counts", "logcounts"))
class: SummarizedExperiment
dim: 230 6
metadata(0):
assays(2): counts logcounts
rownames(230): ACAP1 ACRBP ... ZNF330 ZNF76
rowData names(0):
colnames(6): g1___0 g1___1 ... g2___1 g2___2
colData names(3): groups ident data_source
df |>
aggregate_cells(.sample = c(groups, ident), assays = c("ADT-counts", "ADT-logcounts"))
class: SummarizedExperiment
dim: 25 6
metadata(0):
assays(2): ADT-counts ADT-logcounts
rownames(25): Ab-1 Ab-10 ... Ab-8 Ab-9
rowData names(0):
colnames(6): g1___0 g1___1 ... g2___1 g2___2
colData names(3): groups ident data_source
df |>
aggregate_cells(.sample = c(groups, ident), assays = c("ADT-logcounts", "logcounts"))
class: SummarizedExperiment
dim: 255 6
metadata(0):
assays(2): logcounts ADT-logcounts
rownames(255): ACAP1 ACRBP ... ZNF330 ZNF76
rowData names(1): data_source
colnames(6): g1___0 g1___1 ... g2___1 g2___2
colData names(2): groups ident
Do you happen to know what could be causing this strange behaviour?
I am not sure I see the problem could you point that to me? 230 + 25 = 255
You're quite right, apologies, it doesn't truncate the features, it orders them somehow (?), hence why the ADT ones were hidden from the output (as well as in the head and tail, when I checked those).
The one thing that still does not look consistent is that the last example in my list is the only one where the data_source is in the rowData slot, whereas in the rest it's in the colData slot. I would almost favour everything being in the rowData slot instead. What do you think?
You're quite right, apologies, it doesn't truncate the features, it orders them somehow (?), hence why the ADT ones were hidden from the output (as well as in the head and tail, when I checked those). The one thing that still does not look consistent is that the last example in my list is the only one where the
data_sourceis in therowDataslot, whereas in the rest it's in thecolDataslot. I would almost favour everything being in therowDataslot instead. What do you think?
colData includes info that are about samples, row data about genes. The dimensions should match with samples and genes, so its not about us to decide.
not sure about adt feature desapearing :(
Well I suppose just the fact that it sticks the data_source column into either the colData or the rowData based on the dimensions to me would suggest it's not very robust :). I have now pushed all the changes (including the unit tests) to the altExp_methods branch. If you are happy with how it works I think it can be pulled into the master branch.
Well I suppose just the fact that it sticks the data_source column into either the colData or the rowData based on the dimensions to me would suggest it's not very robust :).
-
I would say, since you are adding the column internally, please add the column manually to
rowData. -
A small detail. Let's be very verbose in the messages (please tell me if I am misunderstanding)
"tidySingleCellExperiment says: The selected assays have overlapping feature names. The feature names have been combined with the selected assay names (stored in the column assay), to keep the rownames of the SingleCellExperiment unique. The original feature names are stored in the column XXX"
-
Could you also confirm that this message is printed only when multiple Experiments & with overlapping gene names?
-
Sorry, I'm thinking, would it be better to call the
data_sourcecolumnassay?
- I would say, since you are adding the column internally, please add the column manually to rowData.
Done.
- A small detail. Let's be very verbose in the messages
Done. The warning now reads: tidySingleCellExperiment says: The selected assays have overlapping feature names. The feature names have been combined with the selected assay_type, to keep the rownames of the SingleCellExperiment unique. You can find the original feature names in the orig.feature.names column of the rowData slot of your object.
- Could you also confirm that this message is printed only when multiple Experiments & with overlapping gene names?
It is now :smiley: (I misunderstood what you meant before).
- Sorry, I'm thinking, would it be better to call the data_source column assay?
I have changed it to assay_type. I think the name is self-explanatory and still distinct from the word assay, which in SingleCellExperiment world refers to counts, logcounts, etc... i.e. even data transformations.
Amazing, we are getting close.
I submitted a review with comments. After this, only two thing will remain
- I will review the code more generally for robustness and style
- You will have to pass the unit tests (I believe they are failing at the moment). Are they gaining locally for you as well?