salmon icon indicating copy to clipboard operation
salmon copied to clipboard

Quantification of scRNA-seq data with UMIs?

Open davismcc opened this issue 7 years ago • 16 comments

Hi Salmon team

Are there plans afoot to support quantification of scRNA-seq data with unique molecular identifiers (UMIs)? UMIs are very commonly used in scRNA-seq data now, and correct quantification requires "de-duplication" of the reads so that each UMI is only counted once for expression quantification.

Doing this is not entirely trivial, as a quick survey of tools available shows (e.g. UMI-tools, umitools, umis, umi).

Nevertheless, it would be very convenient for those Salmon fans amongst us dealing with scRNA-seq data to be able to process scRNA-seq data with UMIs directly with Salmon. Selfishly, this would be awesome for the Salmon integration with the scater package (now fully implemented).

Not to make a thing of it, but kallisto is now providing some support for UMI quantification (https://pachterlab.github.io/kallisto/singlecell.html) ;P

Best Davis

davismcc avatar Sep 26 '16 09:09 davismcc

I would argue that Kallisto is "cheating" a bit by looking at expression of the equivalence classes :p, avoiding the ambiguous assignment issue.

Just offering an opinion as a Salmon fan, but to me it seems extremely hard to make the data generation model of coverage based sequencing compatible with umi-tags.

I've been thinking a bit about this, and actually, if the PCR bias model in 0.7/Alpine is good, it might even make sense to ignore the UMI and quantify the expression based the mRNA tags alone. In this case you would just need to change transcript length to be constant per transcript, and only update effectiveLength's based on the sequence biases.

The reason you have UMI's in the first place, is that 3'/5'-tag libraries will have much lower complexity than full-length libraries, and due to that you can argue that PCR bias will be a larger problem. But if it is possible to accurately account for the PCR bias with GC content, maybe they are not so needed?

vals avatar Sep 26 '16 09:09 vals

Hi guys,

I'm certainly open to adding this type of thing if there's sufficient desire for it. I agree with @vals that making umi tags compatible with a data generation model based on expected coverage seems very difficult. Then, the question just becomes what is the best way to support umi-tagged data. I'm open to suggestions, as well as to good datasets against which different approaches may be tested.

--Rob

rob-p avatar Oct 03 '16 20:10 rob-p

Thanks, both, for your thoughts here.

As I understand it, current UMI quantification approaches take a BAM file with read alignments and then (hopefully in some smart way) count unique UMIs from reads aligned to (overlapping?) genomic features of interest. In the first instance, can Salmon produce output compatible with those sorts of approaches? (I seem to recall it's possible to output (pseudo)BAMs, but I have not yet had a need for this.)

@vals: your suggestion of just ignoring UMIs is interesting - hadn't thought about that. It would be cool to figure out if that actually works as you suggest it might.

I don't have any brilliant brainwaves to offer at the moment, but to you first point, Rob, I definitely think the desire is/will be there. The sheer number of cells being sequenced demands very computationally efficient quantification, and since Salmon is at least as accurate as competitors while being extremely fast, in my mind Salmon is the leading contender for very wide use.

Apparently 10X is about to drop a dataset of 1.3M cells, so yeah...fast methods needed.

D.

davismcc avatar Oct 24 '16 20:10 davismcc

@davismcc What if you mark duplicates using UMIs based on the pseudobam and then use the unique reads to re-quantify? This is pretty ugly since you're quantifying twice and also potentially re-writing the fastq files, but it should hopefully eliminate a majority of the PCR duplicates.

vasisht avatar Nov 11 '16 16:11 vasisht

Hi all,

I'm just chiming in here to say that we are definitely interested in supporting scRNA-seq data "out of the box". At this point, it's really just a matter of deciding what the best approach is. That is, do we have a sufficiently good idea of the appropriate "model" for scRNA-seq to implement that, or is a de-duplicated UMI count over transcripts and equivalence classes the best we can do at this point. I'm open to ideas, thoughts, and suggests on how to test this as we start incorporating this feature into Salmon.

--Rob

rob-p avatar Nov 11 '16 16:11 rob-p

@rob-p There are other applications (e.g. RNA from extracellular vesicles, low input RNASeq) that use UMIs. I'm not sure this should be restricted just to the scRNA-seq case. I don't know how the models would differ for each use case. I don't have any firm thoughts on it at the moment, but happy to help test the different approaches.

vasisht avatar Nov 11 '16 16:11 vasisht

Hi @rob-p and @davismcc . A bit delayed, but this relates to the questions I've been asking on the salmon gitter.

First, it's worth pointing out that the new 10x (v2) sequencing is a lot more like other bead methods, where (i) index reads (i7/i5) are for labelling biological samples (ii) read1 contains the combined cell and molecular/UMI barcodes (ii) read2 is the transcript 3' read. So it seems there is now some data format convergence. Either way, I'd guess that ongoing iterations of the high throughput platforms will keep one read for the transcript 3', reserving the other 2 or 3 reads for some combination of the sample, cell and molecular barcodes.

Before thinking about how to best collapse UMIs, there's also the issue of how best to QC the barcodes and beads. Jim Namesh has some functions; as does Vasilis Ntranos. Arguably this has nothing to do with salmon/kallisto though I think the kallisto guys were smart to include it. It's a good filter even if only for speeding things up.

Then it's really what might be the most appropriate demultiplexing of fastqs to allow compatibility between tecnhiques, I guess. I quite like how the kallisto workflow ends up with a fastq per cell together with a matching UMI file. Then at the very least one can ignore the UMIs (perhaps going with what @vals suggests).

Not sure if that's helpful. But thought to chime in as somebody we would love to see salmon working on the high throughput single-cell platforms that have sample, cell and molecular barcodes. Even if only to test how worthwhile UMIs genuinely are for most applications. This may be a controversial comment, but I suspect for me UMIs will largely end up the same way as spike-ins: useful for quantifying endogenous RNA recovered per cell but perhaps not all that useful beyond that for low read depth single-cell signature profiling.

qfwills avatar Dec 08 '16 00:12 qfwills

I think not doing the demultiplexing is better for practical issues. I have a dataset with 120 000 cells, and I don't think a file system can handle these. And I think this will be common sample sizes soon.

vals avatar Dec 08 '16 10:12 vals

Agreed! I honestly would prefer to just work with a single fastq where I stipulate what the sample, cell and molecular barcodes are.

qfwills avatar Dec 08 '16 17:12 qfwills

Hi there @rob-p . I'm at the Broad/Imes for 3 months as mentioned previously, working with the Shalek team and their new seqwell platform. As a bit of context, this is one of the labs setting the standard for big single-cell "atlasing". My thoughts are still the same as they were in December: irrespective of the method (10x, seqwell, or split pooling ), the go-to will be where one read is used for barcoding (cell barcode with/without UMI ) and the other is for the transcript. I'd still be very keen to see the pseudoaligners supporting this style of single-cell multiplexing (and possibly the UMIs too). Have you perhaps had any further thoughts?

qfwills avatar Feb 21 '17 21:02 qfwills

Hi @davismcc,

Yes; this is very high-up on our to-do list. Right now, we are primarily limited by people (students and myself) able to actually hack away on the codebase. I would say that adding support for single-cell data is in the top 1-3 features on our todo list right now. I'd also encourage you to voice your support for this feature on our survey, which we will use to prioritize feature development.

In addition to the implementation, the other big "question" will be how to support the broadest variety of such data with the most uniform interface and implementation. It seems like barcoding / UMI tagging is a bit "wild-west" right now where every protocol uses it's own format to encode the relevant information. I think that, in that case, some form of pre-processing (a la @vals work), might be the best solution.

rob-p avatar Feb 24 '17 16:02 rob-p

Hi @rob-p . Totally understood (even more severe current limitations here) - survey completed. I think there'll "always" be Illumina-level coding (we use it to multiplex samples or cells), but I suspect most (all?) wild-west method will be some form of using the one read for barcoding. So as long as I can stipulate which bases in the read are which kind of barcode (cell/molecular) that'd be a good start. Of course having more mature methods than the current drop-seq protocol to error correct, remove poly-A, remove adaptor sequences etc. always very welcome. (I suspect @vals is onto something... I still struggle to be entirely convinced that UMIs, as currently used, have the long-term legs that some people think.)

qfwills avatar Feb 27 '17 14:02 qfwills

Yes, scRNA-Seq can be quite useful. Is there any ongoing work on this?

antonkulaga avatar Feb 08 '18 22:02 antonkulaga

It looks like they're starting work on scRNA support: https://github.com/COMBINE-lab/salmon/blob/a41c6b4e38fb23e51b59dc4a0a450071dc92c180/src/CollapsedCellOptimizer.cpp

Seems like @k3yavi is doing most of the implementation so far. Our group would definitely be interested in this functionality, but I understand how difficult resourcing for new features can be. Thanks for all the hard work!

Edit: Barcode detection / preprocessor looks like it lives here: https://github.com/k3yavi/alevin

mdshw5 avatar Mar 09 '18 13:03 mdshw5

Hi @mdshw5,

@k3yavi was originally developing the barcode algorithm in a separate repo, but all of this work has been merged into the salmon repo now. The new alevin command runs the single-cell method, which handles barcode identification and correction, mapping and UMI deduplication, and which is described in this bioRxiv preprint that just landed. We're still actively developing and improving the method and very much welcome any feedback!

rob-p avatar Jun 01 '18 20:06 rob-p

Hi all, it seems that the current HUBMAP data did not support the transfer from salmon to UMI count data. Do you have any ideas about trasferring such salmon data into UMI count data? Thanks a lot. https://portal.hubmapconsortium.org/browse/dataset/c6bb00096b0cf40751f9d6003fb730c7#files

HelloWorldLTY avatar Feb 09 '23 18:02 HelloWorldLTY