scRepertoire
scRepertoire copied to clipboard
screpertoire individual vs multiple samples - loss of data?
Hi Nick,
Highly appreciate all the work you do in the single-cell community - your packages have been very useful on multiple occasions!
I'm having issues with adding 10x VDJ data via scRepertoire in that when using a list of multiple contig files (subsetted from vdjaggr output as don't have access to individual runs at the moment) and applying them to a seurat object containing the matching samples I get much fewer 'hits' with combineExpression than I do if I subset the seurat object for a singular sample and apply only the relevant contig file with combineExpression.
length(unique(singular$CTaa)) [1] 657 length(unique(multiplex$CTaa)) [1] 489
Perhaps I am missing something basic here (and its been a while since I used screpertoire), but I feel like I didn't have this issue previously. I've recently moved workplaces and therefore new PC etc... so all packages should be fairly up to date (scRepertoire 2.0.0).
Best,
Chris
Hey Chris,
Thanks for reaching out - my guess is there is a barcode mismatch between the the singular and multiplex seurat objects vs contig data. I haven't used aggr in awhile, but if I remember correctly, the workflow modifies the suffix number. For example:
- ACTGXXXXXXXXXXX-1 (First Run)
- ACTGXXXXXXXXXXX-2 (Second Run)
You might have to manually edit your contig data.frame for the second run to have barcode matches. If this is not the case, happy to troubleshoot with you more on what is going on - a reproducible example would help immensely for that.
Thanks, Nick
Hey Nick,
Thanks for the quick response. I've pasted the code used below (hope its clear enough) - as you can see I don't think its a barcode issue. Unfortunately I can't share the data (privacy) - do you have a dataset you would prefer I try run the same analysis on that I could then share?
Best,
Chris
#No barcode duplicates in the large seurat object
length(rownames([email protected]))
#> [1] 43275
length(unique(rownames([email protected])))
#> [1] 43275
#creating test objects
multiplex_samples <- c("CRC01_WT", "CRC01_CTRL")
test_object_multiplex <- subset(whole_object, sample %in% multiplex_samples)
test_object_crc01wt <- subset(whole_object, sample == "CRC01_WT")
#VDJ filtering
#whole_vdj has already been filtered for barcodes matching with the seurat object
crc01wt_contig <- subset(whole_vdj, sample == "CRC01_WT")
####multiplex processing
#LoadContigs
multiplex_contig_list <- list(crc01wt_contig, crc01ctrl_contig)
multiplex_contigs <- loadContigs(multiplex_contig_list)
#combine TCR contigs
multiple_combined_tcr <- combineTCR(multiplex_contigs,
samples = c("CRC01_WT", "CRC01_CTRL"),
removeNA = F,
removeMulti = F,
filterMulti = T)
#dims of combined TCR lists and intersect with seurat object barcodes
dim(multiple_combined_tcr$CRC01_WT)
#> [1] 1598 12
length(intersect(multiple_combined_tcr$CRC01_WT$barcode, rownames([email protected])))
#> [1] 1598
#add combined tcr data to object
test_object_multiplex <- combineExpression(multiple_combined_tcr,
test_object_multiplex,
cloneCall = "aa",
group.by = "sample",
proportion = F,
cloneSize = c(Single=1, Small=5, Medium=15, Large=50, Hyperexpanded=5000))
####singular processing
#LoadContigs
crc01wt_contig_list <- list(crc01wt_contig)
crc01_contigs <- loadContigs(crc01wt_contig_list)
#combine TCR contigs
crc01wt_combined_tcr <- combineTCR(crc01_contigs,
samples = c("CRC01_WT"),
removeNA = F,
removeMulti = F,
filterMulti = T)
#dims of combined TCR lists and intersect with seurat object barcodes
dim(crc01wt_combined_tcr$CRC01_WT)
#> [1] 1598 12
length(intersect(crc01wt_combined_tcr$CRC01_WT$barcode, rownames([email protected])))
#> [1] 1598
#add combined tcr data to object
test_object_crc01wt <- combineExpression(crc01wt_combined_tcr,
test_object_crc01wt,
cloneCall = "aa",
group.by = "sample",
proportion = F,
cloneSize = c(Single=1, Small=5, Medium=15, Large=50, Hyperexpanded=5000))
#####comparing CRC01_WT in the multiplex vs singular
dim(subset(test_object_multiplex, sample == "CRC01_WT"))
#> [1] 17925 1764
dim(test_object_crc01wt)
#> [1] 17925 1764
#dims match
length(intersect(rownames(subset(test_object_multiplex, sample == "CRC01_WT")@meta.data),rownames([email protected])))
#> [1] 1764
#barcodes match
length(which(!is.na(subset(test_object_multiplex, sample == "CRC01_WT")$CTaa)))
#> [1] 384
length(which(!is.na(test_object_crc01wt$CTaa)))
#> [1] 1598
#massive difference here
#1598 is the #cells that vdjaggr assigned VDJ data to for this sample
Hi! As nick said, it is quite a bit harder to troubleshoot/debug without any example data/minimal reproducible example. However, to narrow down on the issue:
- make sure you have the latest commit of the master branch because a bug related to
combineTCR
was recently (probably) fixed (#281) - try the exact same script with a previous stable release of scRepertoire, for example the version on the v1 branch, and see if the issue persists.
If the issue is still there then we could probably be more sure that it an issue with the code in the new release.
Hi @Qile0317,
- Can confirm I have the latest commit (I had a similar issue as #279 and #281 when adding TRUST4 data but that was solved with the latest commit) but the issue still persists. In fact, I have the same issue as I originally described when trying to add TRUST4 data.
I completely understand the issue of not having example data/minimal reproducible example, but I am very limited by data sharing regulations. A quick test on the inbuilt scRepertoire dataset did not have the same issue.
- Did not have time to do this but will do so tomorrow and update.
@CAC13 Also understandable about data confidentiality, but it will limit the troubleshooting.
Thanks for sharing the code - have you thoroughly check the barcodes of the combineTCR()
output vs single-cell object?
- Multiplex experiments can have duplicated barcodes - this is generally 1maybe 10-15 cells that have an issue
- Cell Ranger aggr will append a suffix to the barcodes, which will then have a mismatch during
combineExpression()
Nick
Have checked the following so I don't think its a barcode mismatch issue:
- Original seurat object barcodes are all unique (modified using code described in scRepertoire vignette) - no duplicates
-
combineTCR()
output barcodes match the object barcodes perfectly - I've now run the same code/objects with scRepertoire v1.11 and end up with what I would be expecting.
> length(which(!is.na(subset(test_object_multiplex, sample == "CRC01_WT")$CTaa)))
[1] 1598
> length(which(!is.na(test_object_crc01wt$CTaa)))
[1] 1598
So I guess it must be a version specific issue? However, as I mentioned above the scRep inbuilt object had no issues with v2.0.0...
Just in case I also tested with previous versions of Seurat (didn't go past v4.0.1) in case it was an issue with v5 but the issue persisted.
Ok, I might know what the problem is, will be working on a fix. Thanks a lot for finding the root cause! I suspect it might be related to the fact that I had changed an internal function to initialize a matrix of "NA" strings. But yea, as you mentioned, most of the changed functions are currently only tested on the built in data which is subject to change. This might be related to #284 as well but I could also be talking nonsense :P
I also started looking at the integration with my scRNAseq data. What I noticed is that when I use group.by = "sample",
in combineExpression, I get 10k fewer cells in [email protected]
annotated with scRep results. If I don't use it, I get the correct number of annotated cells.
That being said, I was naively expecting that clonalProportion will add up to something round in each of these scenarios, but it doesn't, I get values both above and below 1 per sample when I run this:
[email protected] %>% + group_by(sample) %>% + summarize(n=sum(clonalProportion, na.rm = T))
I renamed barcodes in the combined.TCR from SBM1073_AAACCCAAGTAGGATT to AAACCCAAGTAGGATT-SBM1073.2 to match my sce object, but I haven't looked into the code yet so don't know if that contributes.
UPD. I learnt what clonalProportion represents and why it doesn't add up to 1 in the way I tried.
Hi all,
Appreciate the rapid assistance and help with this - hopefully we have identified the correct issue!
Would you like me to close this or rather wait until the update is pushed and we can confirm?
@ncborcherding @Qile0317 Hi, Is there an update on this issue? I think that line 99 in combineExpression.R should be
datainstead of
dataas there is no reason to ignore the sample column and to get fewer annotated cells when using a grouping by sample, or do I understand this argument wrong?
Struggling to fix the issue without a reproducible example (I have tried a couple of objects beyond the screpertoire test/full objects) - just pushed the commit that @noranekonobokkusu
@ncborcherding So sorry for being gone for so long, I've been working recently to try and reproduce the issue with little luck - and I just wanted to check up - it appears to me that currently combineTCR()
still uses the old C++ code I wrote that caused this issue. And according to @CAC13 this issue isn't present in the old v1.11? Could we actually do a full revert to the R version and then try change up the function components one-by-one? I hope this issue hasn't caused any actual researchers to get incorrect results :P
@Qile0317
I am not convinced this is an issue from scRepertoire side of things. I have tried multiple data sets to reproduce the error and do not find the discrepancy. I believe this is an issue with mismatched barcodes somewhere or possible duplicate barcodes? combineExpression()
will automatically remove duplicate barcodes.
Here is a reproducible example using the full scRep example using a modified version of the above script:
whole_object <- readRDS("/Users/nick/Documents/GitHub/scRepertoire/vignettes/articles/scRep_example_full.rds")
#No barcode duplicates in the large seurat object
length(rownames([email protected]))
#> [1] 30105
length(unique(rownames([email protected])))
#> [1] 30105
#creating test objects
multiplex_samples <- c("P17B", "P17L")
test_object_multiplex <- subset(whole_object, orig.ident %in% multiplex_samples)
test_object_P17B <- subset(whole_object, orig.ident == "P17B")
#VDJ filtering
#whole_vdj has already been filtered for barcodes matching with the seurat object
P17B_contig <- contig_list[1]
####multiplex processing
#LoadContigs
multiplex_contig_list <- contig_list[1:2]
#combine TCR contigs
multiple_combined_tcr <- combineTCR(multiplex_contig_list,
samples = c("P17B", "P17L"),
removeNA = F,
removeMulti = F,
filterMulti = T)
#dims of combined TCR lists and intersect with seurat object barcodes
dim(multiple_combined_tcr$P17B)
#> [1] 2805 12
length(intersect(multiple_combined_tcr$P17B$barcode, rownames([email protected])))
#> [1] 2640
#add combined tcr data to object
test_object_multiplex <- combineExpression(multiple_combined_tcr,
test_object_multiplex,
cloneCall = "aa",
group.by = "sample",
proportion = F,
cloneSize = c(Single=1, Small=5, Medium=15, Large=50, Hyperexpanded=5000))
#combine TCR contigs
P17B_combined_tcr <- combineTCR(P17B_contig,
samples = "P17B",
removeNA = F,
removeMulti = F,
filterMulti = T)
#dims of combined TCR lists and intersect with seurat object barcodes
dim(P17B_combined_tcr$P17B)
#> [1] 2805 12
length(intersect(P17B_combined_tcr$P17B$barcode, rownames([email protected])))
#> [1] 2640
#add combined tcr data to object
test_object_P17B <- combineExpression(P17B_combined_tcr,
test_object_P17B,
cloneCall = "aa",
group.by = "sample",
proportion = F,
cloneSize = c(Single=1, Small=5, Medium=15, Large=50, Hyperexpanded=5000))
#####comparing P17B in the multiplex vs singular
dim(subset(test_object_multiplex, orig.ident == "P17B"))
#> [1] 33538 3020
dim(test_object_P17B)
#> [1] 3538 3020
length(intersect(rownames(subset(test_object_multiplex, orig.ident == "P17B")@meta.data),rownames([email protected])))
#> [1] 3020
length(which(!is.na(subset(test_object_multiplex, orig.ident == "P17B")$CTaa)))
#> [1] 2640
length(which(!is.na(test_object_P17B$CTaa)))
#> [1] 2640
For now I am going to close this until someone can provide a reproducible example of the problem.
Nick