CITE-seq-Count icon indicating copy to clipboard operation
CITE-seq-Count copied to clipboard

Sequence saturation UMI index & barcode rank features

Open danmoore1987 opened this issue 4 years ago • 15 comments

Hi @Hoohm , Thank you again for the easy to use package!

Just a suggested enhancement feature that i think a lot of people might be interested in. Would be cool if in the CITE count report workflow you also generate a barcode rank and UMI saturation index plots/csv! :)

danmoore1987 avatar Sep 23 '19 03:09 danmoore1987

Hello @danmoore1987, why not, would you have any specific example to link here so that I can take a look?

Hoohm avatar Oct 01 '19 15:10 Hoohm

Hi @Hoohm, Thanks for the reply!

Essentially some of the cell ranger outputs. So one to measure per barcode, how many UMI's were assigned. gex-barcode-rank-plot

The second is this. Its so we know if we need to sequence the CITE library deeper. Capture_saturation index

Cell ranger uses this formula for it: Sequencing Saturation = 1 - (n_deduped_reads / n_reads) where: n_deduped_reads = Number of unique (valid cell-barcode, valid UMI, gene) combinations among confidently mapped reads. n_reads = Total number of confidently mapped, valid cell-barcode, valid UMI reads.

Cheers!

danmoore1987 avatar Oct 11 '19 05:10 danmoore1987

Just want to echo: this would be quite useful to know. we're actively trying to sort out whether some of our lesser libraries would benefit from more sequence depth.

I believe the run_report gives data to calculate saturation, at least globally, right? I think it would be valid to use 'Reads processed' as n_reads (we could adjust by percentage mapped?), and 'UMIs corrected' as n_deduped_reads, correct?

The per-cell plot above is informative. Presumably one could read/merge the 'read_count' folder and 'umi_count' folders to accomplish this, right?

bbimber avatar Dec 11 '20 22:12 bbimber

I need to sanity check the data, but this is derived from combining umi_count and read_count folders:

image

the R code is here: https://github.com/BimberLab/cellhashR/blob/2211878b792d7c0c5ff48e4183cdcd7a44dec8b8/R/Preprocessing.R#L278

bbimber avatar Dec 13 '20 14:12 bbimber

This is great @bbimber !

I also checked out the rest of your cellhashR package for post-processing QC of libraries. Can't wait to give it a go! :)

danmoore1987 avatar Dec 24 '20 08:12 danmoore1987

@danmoore1987 yes, i'm still surprised there arent more tools that exist that do what we're trying in cellhashR. we'd welcome any feedback. part of my goal is cellhashR is to specifically compare across different calling algorithms, since we find some do better or worse with different inputs.

With respect to saturation in particular, it would be great if you could confirm the tool is giving you believable values. I was surprised how non-saturated our libraries often were, but this wasnt something I had been tracking.

bbimber avatar Dec 24 '20 14:12 bbimber

Ok, folks, I'm on holiday!!!

Let me take a look since I'm gonna work on this damn 1.5 release!!!

I'll keep you posted :)

Thanks for the code!

Hoohm avatar Dec 24 '20 14:12 Hoohm

@Hoohm No worries - I actually think we implemented this in cellhashR; however, I'd love to figure out features that make this work synergistically with Cite-Seq-Count.

bbimber avatar Dec 24 '20 14:12 bbimber

Yes! That would be amazing. Can you send me an email so we can have a quick chat these days maybe?

Hoohm avatar Dec 24 '20 15:12 Hoohm

Ok, so 1.5.0 is nearly finished. Running some tests on datasets to see how it matches the older version.

For your specific needs here is a non exhaustive list of changes that affects your code:

  • MTX format has changed. First column is now the TAG sequence, second column feature name. This means that Read10X runs by default on the right column (gene.colunm=2)
  • UMI MTX counts as well as the dense matrix have dropped the unmapped feature.
  • For technologies such as 10x v3 which uses two different barcodes for each cell when running mRNA and protein data, you can now provide the translation reference. First column, cell barcode in the mRNA data, second column cell barcode in the protein data. The MTX outputs will have two columns in the barcodes.tsv file, first, default will be the mRNA column, second will be the Protein data.

I think these are the only ones affecting your code, but I might be missing something. Let me know :)

Hoohm avatar Dec 30 '20 09:12 Hoohm

@Hoohm is there a heuristic code can perform to determine what format of input it's getting? for example, if we have a function for processCiteSeqCount(outputFolder), can this code automatically figure out what format it was passed?

bbimber avatar Dec 30 '20 14:12 bbimber

Not sure which format you are referring to.

If you are talking about the translated version, then yes, the barcodes.tsv will hold two columns instead of two.

On Wed, 30 Dec 2020, 15:14 bbimber, [email protected] wrote:

@Hoohm https://github.com/Hoohm is there a heuristic code can perform to determine what format of input it's getting? for example, if we have a function for processCiteSeqCount(outputFolder), can this code automatically figure out what format it was passed?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Hoohm/CITE-seq-Count/issues/81#issuecomment-752636189, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJVO2DYYSL4TRCSO2FPIOTSXMYTLANCNFSM4IZF55AA .

Hoohm avatar Dec 30 '20 15:12 Hoohm

Maybe I misunderstood, but in your prior post didnt you say the MTX format is changing in version 1.5.0? Ideally, I would like cellhashR::ProcessCountMatrix() to just work with either the output from Cite-Seq-Count 1.5.0 or prior versions. I suppose I could read the matrix into memory with gene.column=1, test for the presence of 'unmapped', and if it's not present re-read using gene.column=2?

bbimber avatar Dec 30 '20 17:12 bbimber

It's not changing that much.

I would really love to have a chat on zoom with you, would be interesting to have a back and forth about this since I'm not completely fixed on everything.

On Wed, 30 Dec 2020, 18:07 bbimber, [email protected] wrote:

Maybe I misunderstood, but in your prior post didnt you say the MTX format is changing in version 1.5.0? Ideally, I would like cellhashR::ProcessCountMatrix() to just work with either the output from Cite-Seq-Count 1.5.0 or prior versions. I suppose I could read the matrix into memory with gene.column=1, test for the presence of 'unmapped', and if it's not present re-read using gene.column=2?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Hoohm/CITE-seq-Count/issues/81#issuecomment-752692652, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJVO2BBFOXOJI2NZHIWM7LSXNM5NANCNFSM4IZF55AA .

Hoohm avatar Dec 30 '20 17:12 Hoohm

sure - would be happy to. i didnt realize you worked at 10x until I googled your name just now. my email is [email protected]

bbimber avatar Dec 30 '20 17:12 bbimber