scRepertoire icon indicating copy to clipboard operation
scRepertoire copied to clipboard

screpertoire individual vs multiple samples - loss of data?

Open CAC13 opened this issue 1 year ago • 11 comments

Hi Nick,

Highly appreciate all the work you do in the single-cell community - your packages have been very useful on multiple occasions!

I'm having issues with adding 10x VDJ data via scRepertoire in that when using a list of multiple contig files (subsetted from vdjaggr output as don't have access to individual runs at the moment) and applying them to a seurat object containing the matching samples I get much fewer 'hits' with combineExpression than I do if I subset the seurat object for a singular sample and apply only the relevant contig file with combineExpression.

length(unique(singular$CTaa)) [1] 657 length(unique(multiplex$CTaa)) [1] 489

Perhaps I am missing something basic here (and its been a while since I used screpertoire), but I feel like I didn't have this issue previously. I've recently moved workplaces and therefore new PC etc... so all packages should be fairly up to date (scRepertoire 2.0.0).

Best,

Chris

CAC13 avatar Dec 03 '23 20:12 CAC13

Hey Chris,

Thanks for reaching out - my guess is there is a barcode mismatch between the the singular and multiplex seurat objects vs contig data. I haven't used aggr in awhile, but if I remember correctly, the workflow modifies the suffix number. For example:

  • ACTGXXXXXXXXXXX-1 (First Run)
  • ACTGXXXXXXXXXXX-2 (Second Run)

You might have to manually edit your contig data.frame for the second run to have barcode matches. If this is not the case, happy to troubleshoot with you more on what is going on - a reproducible example would help immensely for that.

Thanks, Nick

ncborcherding avatar Dec 04 '23 13:12 ncborcherding

Hey Nick,

Thanks for the quick response. I've pasted the code used below (hope its clear enough) - as you can see I don't think its a barcode issue. Unfortunately I can't share the data (privacy) - do you have a dataset you would prefer I try run the same analysis on that I could then share?

Best,

Chris

#No barcode duplicates in the large seurat object
length(rownames([email protected]))
#> [1] 43275
length(unique(rownames([email protected])))
#> [1] 43275

#creating test objects
multiplex_samples <- c("CRC01_WT", "CRC01_CTRL")
test_object_multiplex <- subset(whole_object, sample %in% multiplex_samples)
test_object_crc01wt <- subset(whole_object, sample == "CRC01_WT")

#VDJ filtering
#whole_vdj has already been filtered for barcodes matching with the seurat object
crc01wt_contig <- subset(whole_vdj, sample == "CRC01_WT")

####multiplex processing
#LoadContigs
multiplex_contig_list <- list(crc01wt_contig, crc01ctrl_contig)
multiplex_contigs <- loadContigs(multiplex_contig_list)

#combine TCR contigs
multiple_combined_tcr <- combineTCR(multiplex_contigs,
                                    samples = c("CRC01_WT", "CRC01_CTRL"),
                                    removeNA = F,
                                    removeMulti = F,
                                    filterMulti = T)

#dims of combined TCR lists and intersect with seurat object barcodes
dim(multiple_combined_tcr$CRC01_WT)
#> [1] 1598   12
length(intersect(multiple_combined_tcr$CRC01_WT$barcode, rownames([email protected])))
#> [1] 1598

#add combined tcr data to object
test_object_multiplex <- combineExpression(multiple_combined_tcr,
                                           test_object_multiplex,
                                           cloneCall = "aa",
                                           group.by = "sample",
                                           proportion = F,
                                           cloneSize = c(Single=1, Small=5, Medium=15, Large=50, Hyperexpanded=5000))

####singular processing
#LoadContigs
crc01wt_contig_list <- list(crc01wt_contig)
crc01_contigs <- loadContigs(crc01wt_contig_list)

#combine TCR contigs
crc01wt_combined_tcr <- combineTCR(crc01_contigs,
                                   samples = c("CRC01_WT"),
                                   removeNA = F,
                                   removeMulti = F,
                                   filterMulti = T)

#dims of combined TCR lists and intersect with seurat object barcodes
dim(crc01wt_combined_tcr$CRC01_WT)
#> [1] 1598   12
length(intersect(crc01wt_combined_tcr$CRC01_WT$barcode, rownames([email protected])))
#> [1] 1598

#add combined tcr data to object
test_object_crc01wt <- combineExpression(crc01wt_combined_tcr,
                                         test_object_crc01wt,
                                         cloneCall = "aa",
                                         group.by = "sample",
                                         proportion = F,
                                         cloneSize = c(Single=1, Small=5, Medium=15, Large=50, Hyperexpanded=5000))

#####comparing CRC01_WT in the multiplex vs singular
dim(subset(test_object_multiplex, sample == "CRC01_WT"))
#> [1] 17925  1764
dim(test_object_crc01wt)
#> [1] 17925  1764
#dims match

length(intersect(rownames(subset(test_object_multiplex, sample == "CRC01_WT")@meta.data),rownames([email protected])))
#> [1] 1764
#barcodes match

length(which(!is.na(subset(test_object_multiplex, sample == "CRC01_WT")$CTaa)))
#> [1] 384
length(which(!is.na(test_object_crc01wt$CTaa)))
#> [1] 1598
#massive difference here
#1598 is the #cells that vdjaggr assigned VDJ data to for this sample

CAC13 avatar Dec 04 '23 17:12 CAC13

Hi! As nick said, it is quite a bit harder to troubleshoot/debug without any example data/minimal reproducible example. However, to narrow down on the issue:

  1. make sure you have the latest commit of the master branch because a bug related to combineTCR was recently (probably) fixed (#281)
  2. try the exact same script with a previous stable release of scRepertoire, for example the version on the v1 branch, and see if the issue persists.

If the issue is still there then we could probably be more sure that it an issue with the code in the new release.

Qile0317 avatar Dec 04 '23 20:12 Qile0317

Hi @Qile0317,

  1. Can confirm I have the latest commit (I had a similar issue as #279 and #281 when adding TRUST4 data but that was solved with the latest commit) but the issue still persists. In fact, I have the same issue as I originally described when trying to add TRUST4 data.

I completely understand the issue of not having example data/minimal reproducible example, but I am very limited by data sharing regulations. A quick test on the inbuilt scRepertoire dataset did not have the same issue.

  1. Did not have time to do this but will do so tomorrow and update.

CAC13 avatar Dec 04 '23 22:12 CAC13

@CAC13 Also understandable about data confidentiality, but it will limit the troubleshooting.

Thanks for sharing the code - have you thoroughly check the barcodes of the combineTCR() output vs single-cell object?

  • Multiplex experiments can have duplicated barcodes - this is generally 1maybe 10-15 cells that have an issue
  • Cell Ranger aggr will append a suffix to the barcodes, which will then have a mismatch during combineExpression()

Nick

ncborcherding avatar Dec 05 '23 09:12 ncborcherding

Have checked the following so I don't think its a barcode mismatch issue:

  1. Original seurat object barcodes are all unique (modified using code described in scRepertoire vignette) - no duplicates
  2. combineTCR() output barcodes match the object barcodes perfectly
  3. I've now run the same code/objects with scRepertoire v1.11 and end up with what I would be expecting.
> length(which(!is.na(subset(test_object_multiplex, sample == "CRC01_WT")$CTaa)))
[1] 1598
> length(which(!is.na(test_object_crc01wt$CTaa)))
[1] 1598

So I guess it must be a version specific issue? However, as I mentioned above the scRep inbuilt object had no issues with v2.0.0...

Just in case I also tested with previous versions of Seurat (didn't go past v4.0.1) in case it was an issue with v5 but the issue persisted.

CAC13 avatar Dec 05 '23 12:12 CAC13

Ok, I might know what the problem is, will be working on a fix. Thanks a lot for finding the root cause! I suspect it might be related to the fact that I had changed an internal function to initialize a matrix of "NA" strings. But yea, as you mentioned, most of the changed functions are currently only tested on the built in data which is subject to change. This might be related to #284 as well but I could also be talking nonsense :P

Qile0317 avatar Dec 05 '23 20:12 Qile0317

I also started looking at the integration with my scRNAseq data. What I noticed is that when I use group.by = "sample", in combineExpression, I get 10k fewer cells in [email protected] annotated with scRep results. If I don't use it, I get the correct number of annotated cells.

That being said, I was naively expecting that clonalProportion will add up to something round in each of these scenarios, but it doesn't, I get values both above and below 1 per sample when I run this:

[email protected] %>%
+   group_by(sample) %>%
+   summarize(n=sum(clonalProportion, na.rm = T)) 

I renamed barcodes in the combined.TCR from SBM1073_AAACCCAAGTAGGATT to AAACCCAAGTAGGATT-SBM1073.2 to match my sce object, but I haven't looked into the code yet so don't know if that contributes.

UPD. I learnt what clonalProportion represents and why it doesn't add up to 1 in the way I tried.

noranekonobokkusu avatar Dec 05 '23 20:12 noranekonobokkusu

Hi all,

Appreciate the rapid assistance and help with this - hopefully we have identified the correct issue!

Would you like me to close this or rather wait until the update is pushed and we can confirm?

CAC13 avatar Dec 06 '23 15:12 CAC13

@ncborcherding @Qile0317 Hi, Is there an update on this issue? I think that line 99 in combineExpression.R should be

data 

instead of

data 

as there is no reason to ignore the sample column and to get fewer annotated cells when using a grouping by sample, or do I understand this argument wrong?

noranekonobokkusu avatar Dec 15 '23 21:12 noranekonobokkusu

Struggling to fix the issue without a reproducible example (I have tried a couple of objects beyond the screpertoire test/full objects) - just pushed the commit that @noranekonobokkusu

ncborcherding avatar Dec 19 '23 13:12 ncborcherding

@ncborcherding So sorry for being gone for so long, I've been working recently to try and reproduce the issue with little luck - and I just wanted to check up - it appears to me that currently combineTCR() still uses the old C++ code I wrote that caused this issue. And according to @CAC13 this issue isn't present in the old v1.11? Could we actually do a full revert to the R version and then try change up the function components one-by-one? I hope this issue hasn't caused any actual researchers to get incorrect results :P

Qile0317 avatar Jul 25 '24 03:07 Qile0317

@Qile0317

I am not convinced this is an issue from scRepertoire side of things. I have tried multiple data sets to reproduce the error and do not find the discrepancy. I believe this is an issue with mismatched barcodes somewhere or possible duplicate barcodes? combineExpression() will automatically remove duplicate barcodes.

Here is a reproducible example using the full scRep example using a modified version of the above script:

whole_object <- readRDS("/Users/nick/Documents/GitHub/scRepertoire/vignettes/articles/scRep_example_full.rds")
#No barcode duplicates in the large seurat object
length(rownames([email protected]))
#> [1] 30105
length(unique(rownames([email protected])))
#> [1] 30105

#creating test objects
multiplex_samples <- c("P17B", "P17L")
test_object_multiplex <- subset(whole_object, orig.ident %in% multiplex_samples)
test_object_P17B <- subset(whole_object, orig.ident == "P17B")

#VDJ filtering
#whole_vdj has already been filtered for barcodes matching with the seurat object
P17B_contig <- contig_list[1]

####multiplex processing
#LoadContigs
multiplex_contig_list <- contig_list[1:2]

#combine TCR contigs
multiple_combined_tcr <- combineTCR(multiplex_contig_list,
                                    samples = c("P17B", "P17L"),
                                    removeNA = F,
                                    removeMulti = F,
                                    filterMulti = T)

#dims of combined TCR lists and intersect with seurat object barcodes
dim(multiple_combined_tcr$P17B)
#> [1] 2805   12
length(intersect(multiple_combined_tcr$P17B$barcode, rownames([email protected])))
#> [1] 2640

#add combined tcr data to object
test_object_multiplex <- combineExpression(multiple_combined_tcr,
                                           test_object_multiplex,
                                           cloneCall = "aa",
                                           group.by = "sample",
                                           proportion = F,
                                           cloneSize = c(Single=1, Small=5, Medium=15, Large=50, Hyperexpanded=5000))

#combine TCR contigs
P17B_combined_tcr <- combineTCR(P17B_contig,
                                   samples = "P17B",
                                   removeNA = F,
                                   removeMulti = F,
                                   filterMulti = T)

#dims of combined TCR lists and intersect with seurat object barcodes
dim(P17B_combined_tcr$P17B)
#> [1] 2805   12
length(intersect(P17B_combined_tcr$P17B$barcode, rownames([email protected])))
#> [1] 2640

#add combined tcr data to object
test_object_P17B <- combineExpression(P17B_combined_tcr,
                                         test_object_P17B,
                                         cloneCall = "aa",
                                         group.by = "sample",
                                         proportion = F,
                                         cloneSize = c(Single=1, Small=5, Medium=15, Large=50, Hyperexpanded=5000))

#####comparing P17B in the multiplex vs singular
dim(subset(test_object_multiplex, orig.ident == "P17B"))
#> [1] 33538  3020
dim(test_object_P17B)
#> [1] 3538  3020


length(intersect(rownames(subset(test_object_multiplex, orig.ident == "P17B")@meta.data),rownames([email protected])))
#> [1] 3020


length(which(!is.na(subset(test_object_multiplex, orig.ident == "P17B")$CTaa)))
#> [1] 2640
length(which(!is.na(test_object_P17B$CTaa)))
#> [1] 2640

For now I am going to close this until someone can provide a reproducible example of the problem.

Nick

ncborcherding avatar Jul 25 '24 14:07 ncborcherding