alevin-fry
alevin-fry copied to clipboard
Questions regarding barcodes with zero reads after deduplication
Hi
I've been reading and working through various tutorials and bits of documentation, and have found the alevinQC package very helpful for understanding what is going on with the generate-permit-list step.
I'm a bit confused about the mapping step after the permit list step. It feels like there's a missing QC step here. I'm describing my understanding of the process, both to check I've got it right and in case it's helpful to someone else.
In the (very nice) alevinQC report, we get something like this for the white list, telling us which of the barcodes are most likely "real cells" rather than empty droplets:

These reads are then mapped, and some of them end up not mapping to any genes at all. This gives us the actual counts for each white-listed barcode, and looks like this:

I have a couple of questions:
- The barcodes with 0 / very few genes assigned to them are useless, right? Shouldn't we also have a step to exclude them? For example a second knee step based on number of genes detected?
- More out of curiosity, what is going on with these barcodes?? The plot below shows that we get many barcodes whose frequencies are similar to "real" cells, and even ones with super high frequencies (e.g. the highest frequency in this sample), where the reads do not map to any genes at all. Is this expected?
Thanks! Will

Hi @wmacnair,
Thanks for the detailed question (and for including the nice graphics!). I will also cc @csoneson on my reply for her — always invaluable — take.
Regarding your first question; yes. There is often a desire to perform a filtering based on the final UMI counts of each cell. Note, that the UMI counts (after deduplication) can differ, sometimes dramatically, from the read counts before deduplication. Once you have loaded the expression matrix from alevin-fry into your analysis environment (say R), then you likely want to use some package like DropletUtils to do a cell filtering. In fact, @DongzeHE has directly contributed to DropletUtils the function for applying the CellRanger-like emptyDrops filtering (originally reverse-engineered from the CellRanger implementation by Alex Dobin). Of course, the challenge is that, to keep the processing both time and memory efficient, we must determine a putative set of barcodes prior to quantification, so that we can group together all reads that belong to the same cell (corrected barcode), which is what is done in the collate step.
Regarding your second question, it's a bit less clear what's going on with the number of detected genes. One question I have is, what count values are you using for gene detection? Is it possible that some of the cells have high unspliced gene content that is being discarded if using the default scRNA output_format when loading in the expression matrices? It is worth noting that the barcode counts for unmapped reads are propagated though the analysis, and are used to provide some of the output statistics that AlevinQC eventually uses. However, they are segregated in the initial mapping step, since unmapped reads cannot be used for quantifying genes. To explain the behavior you see, I can think of a couple of possible explanations off the top of my head (there may, of course, be more or better ones), in no particular order (1) There are some highly-sequenced empty/contaminated droplets that get a high read count but result in no expressed genes, (2) These cells have a high fraction of intronic reads which are being discarded in the default input mode, leading to a large count but no detected genes. It is worth noting that after quantification, the reads mapping to genes have been collapsed based on UMI. However, for unmapped reads, the UMIs cannot be reliably collapsed (since you need to know the gene from which they derive to collapse them). This means, it makes sense to compare the counts of reads deriving from mapped and unmapped reads per barcode initially, but it doesn't really make sense to compare the collapsed UMI count of mapped reads post-quantification to the read count of unmapped reads. I'll punt to @csoneson on what, particularly, is being displayed in the plots at the bottom of your post.
Best, Rob
Hi @wmacnair 🙂
I'm a bit surprised by the detected gene knee plot. I have seen other datasets where there were cells with basically no detected genes, but there were usually maybe 2-10 such cells, not hundreds as it looks like in your case 🤔 . In those data sets, I was running alevin-fry with a splici index, so intronic reads would be counted. Looking at the data in more detail (e.g., via the cbTable output of alevinQC::readAlevinFryQC()), it seems there is extreme UMI duplication in those cells - there is a fair number of UMIs mapped, but only very few unique ones (and consequently very few genes). Is that the case also for you? Maybe extracting the reads corresponding to these cell barcodes and looking where they map on the genome would give some hints of what's going on and whether there really seems to be a high degree of duplication?
As for what's being shown in the plots, alevinQC doesn't read the count matrix, but only considers the metadata. The gene knee plot displays the NumGenesExpressed column from the feature dump. The last set of plots relate the CorrectedReads, DeduplicatedReads and NumGenesExpressed columns.
Hi both, thanks for your detailed responses. I need to read them carefully, so it will take me a little while to respond to the tricky bits.
On the easy bit, that makes sense regarding the knee plots.
I think I would then suggest making this more explicit in the documentation. At present, the help page for generate-permit-list says 'it determines what cell barcodes should be associated with “true” cells', and it sounds like that's not quite right. Is that fair? I think it would be helpful to add a summary of what you've said here, and also say explicitly that most users will want to run something like DropletUtils::emptyDrops in addition. I think my confusion comes from expecting it to be like cellranger, where (hope I've got this right) there is also a knee step, but after that you're in principle good to go.
I will have a look at which genes are present in the barcodes that collapse to zero. I didn't make any choices around which values are used for gene detection - these are just alevinQC default outputs (as Charlotte has described).
For some additional context, here are some notes on what I've done:
- This is single cell PBMC data.
- I used the most recent annotations / genomes from gencode (specifically gencode.v40.primary_assembly.annotation_annots.txt.gz and gencode.v40.primary_assembly.annotation.gtf.gz).
- I generated a splici index from these.
- I ran the code at the bottom for each sample.
I'm a complete noob when it comes to mapping, so there are multiple things I'm unsure about:
- Is this the right reference? A colleague who knows more about genomes than I do suggested this, but we could be wrong.
- There are a very large number of pseudogenes in the reference, which could be expected but surprised me. Could this explain the weird collapsing barcodes?
- The percentage of spliced reads seems low. If I compare S and U reads, it seems to be ~30% spliced. This is similar to what I saw in single nuclei data, and is quite a bit lower than I was expecting.
Hope this helps give some clues. I'll dig around in the genes for the collapsing barcodes and let you know what I find.
Will
part of my bsub job:
## map
salmon alevin -lISR -1 $r1_fqs -2 $r2_fqs --chromiumV3 -i $idx_f -p $n_cores \
-o outs_alevin --tgMap $t2g_f --sketch --dumpFeatures
## find proper cells
alevin-fry generate-permit-list --input outs_alevin --expected-ori fw --output-dir outs_fry -k
## group barcodes
alevin-fry collate -i outs_fry -r outs_alevin -t $n_cores
## get counts
alevin-fry quant -t $n_cores -i outs_fry -o outs_fry -m $t2g_f -r cr-like --use-mtx
A bit of further analysis, looking at all of our samples. It seems that we consistently get ~5% of permit-list barcodes ending up with 0 genes (ignoring two samples with v small permit lists).
sample_id n_total n_zero pct_zero
1: CN003 141 2 1.4
2: CN008 1550 56 3.6
3: CN004 4420 193 4.4
4: EG001_P 19977 966 4.8
5: EG004_C 10269 508 4.9
6: CN005 8256 427 5.2
7: EG004_P 17866 1009 5.6
8: CN002 6367 361 5.7
9: EG002_C 3784 216 5.7
10: EG003_C 15359 870 5.7
11: CN007 7334 432 5.9
12: CN001 1721 106 6.2
13: EG001_C 3383 215 6.4
14: EG002_P 9179 587 6.4
15: CN006 2154 145 6.7
16: EG003_P 186 21 11.3
I picked the sample with the largest number of zero-gene barcodes (EG004_P), and calculated medians for various measures recorded in cbTable. The number of mapped UMIs is very similar, and I think the 0s for umi count and deduplication rate are expected.
Is the higher mapping rate in zero-gene barcodes surprising? This is a chunk higher in the barcodes with zero genes.
zero_genes quantile q_mapped_umis q_umi_count q_dedup q_mapping
1: FALSE 0.25 3353 500 0.1452 0.6734
2: FALSE 0.50 4645 1180 0.2864 0.7115
3: FALSE 0.75 6678 1672 0.3072 0.7737
4: TRUE 0.25 2751 0 0.0000 0.9549
5: TRUE 0.50 4208 0 0.0000 0.9591
6: TRUE 0.75 7073 0 0.0000 0.9625
What would be the easiest way to see where these highly duplicated barcodes map to in the genome?
Thanks! Will
Hi @wmacnair,
Are any of these samples sharable? There are a few ways one might check what's going on. One would be to take the collated RAD file, and then use the view command to produce a text-formatted version. From there, you could grep the corrected cell barcode to get all of the records for that cell. This would produce a list of records whose raw UMI frequency could then be analyzed. Any thoughts on this @csoneson?
Best, Rob
If you happen to have a bam file of reads mapped to the genome, I think you could extract the reads corresponding to a given cell barcode using e.g. https://github.com/10XGenomics/subset-bam. Regarding the higher mapping rate I don't know - if the reads are all coming from the same place maybe that does make sense (either they all map, or none of them maps). I'm a bit puzzled by the fact that the deduplicated UMI count is exactly zero (do I read the table correctly?) - in my examples there were few, but not zero, entries retained. 🤔
Is it possible that in the QC aggregation that, somehow, unmapped UMIs are being considered as mapped?
The nbrMappedUMI columns directly corresponds to the MappedReads column in the alevin-fry output, and the totalUMICount to the DeduplicatedReads one:
https://github.com/csoneson/alevinQC/blob/f406fd6b3f1e327de749e7dec989351137f07022/R/readAlevinFryQC.R#L182-L183
One of @wmacnair's points that I've overlooked is the reference itself. It may indeed be worth considering using something like the 10X reference annotation. A large number of pseudogenes could result in a larger degree of multi-mapping, which, under the standard cr-like resolution strategy (as well as under the default strategy of other tools) will be discarded. Is this a standard / publicly-available PBMC dataset we could look at?
Hi both, thanks again for taking the time!
Some answers / responses, slightly reordered to help me make sense of new information:
Are any of these samples sharable? Is this a standard / publicly-available PBMC dataset we could look at?
Sadly not, I think - they're human samples from a clinical study that's only just started.
I'm a bit puzzled by the fact that the deduplicated UMI count is exactly zero (do I read the table correctly?) - in my examples there were few, but not zero, entries retained.
Yes, these are exact zeros.
There are a few ways one might check what's going on. One would be to take the collated RAD file, and then use the view command to produce a text-formatted version. From there, you could grep the corrected cell barcode to get all of the records for that cell. If you happen to have a bam file of reads mapped to the genome, I think you could extract the reads corresponding to a given cell barcode using e.g. https://github.com/10XGenomics/subset-bam.
I'm not promising success, but I will have a go at this...!
One of @wmacnair's points that I've overlooked is the reference itself. It may indeed be worth considering using something like the 10X reference annotation. A large number of pseudogenes could result in a larger degree of multi-mapping, which, under the standard cr-like resolution strategy (as well as under the default strategy of other tools) will be discarded. Regarding the higher mapping rate I don't know - if the reads are all coming from the same place maybe that does make sense (either they all map, or none of them maps).
Thanks for these points. Here's my go at synthesising them into a possible explanation, with some questions / comments:
- The reference transcriptome I'm using might have a lot of additional pseudogenes (because I'm using the primary_assembly file? I don't really understand the difference between gencode.v40.annotation.gtf.gz and gencode.v40.primary_assembly.annotation.gtf.gz.)
- For this to explain our completely deduplicated barcodes, these strange barcodes must map pretty much exclusively to some ambiguous pseudogenes. Is this right? Would this also tend to mean that they mapped to a small number of genomic locations?
Things I can do to provide more information:
- Rerun
alevin-fryon this sample with thecellrangerreference. - Try to work out where the "zero" reads are mapping to in the genome.
Let me know if there is anything else useful that I can look at. And also if you have any comments on the possible explanation.
Best Will
I guess I would be a bit surprised if it was because of using the primary file rather than the basic gencode one - it adds 54 genes/59 transcripts (to the 61544/246624 in the annotation.gtf.gz one), and most of them appears to be annotated as protein coding. Maybe one other thing that you can try to pinpoint the cause would be to rerun the alevin-fry quant step with a different UMI resolution strategy (e.g. cr-like-em).
Ok, thanks that's helpful for understanding the differences.
I've just been discussing this with a colleague and we're wondering if it could be the choice of gene biotype that is the issue, specifically the inclusion of pseudogenes. A possible explanation is:
- I include pseudogenes in my reference transcriptome (compared to
cellranger's reference which restricts to protein coding / lincRNA / antisense / IG + TR genes;cellrangerbuild notes here). - Reads that map to both a gene and its v similar pseudogene end up being discarded due to multi-mapping, so the UMI counts for such genes are massively reduced, possibly to zero.
- For this to explain the zero-read droplets, we would have to have some barcodes that only derive from one / few gene, all of which have corresponding v similar pseudogenes.
Does this seem plausible? And in general, what is your advice on including / excluding pseudogenes? (Either at the stage of making your reference transcriptome, or later.)
I think that a situation like that could potentially explain what you see. @rob-p can provide more insight into the UMI deduplication, but I think that the gene/pseudogene pairs must also be very similar, so that there is a tie for the highest count gene for all the UMIs. Running with the cr-like-em UMI resolution would not discard these UMIs, so I think it would indicate if this is where the problem lies.
Ah nice, something testable! I will try to rerun both with cr-like-em, and using the cellranger reference, and let you know the results.
@rob-p, quick Q - could you point me towards the syntax for the view command? Thanks!
@wmacnair,
Sure thing! So, first please make sure you grab the latest release (0.5.1) which fixed an options conflict between the top-level help and the flag for the view mode to include the header information. View looks like this.
❯alevin-fry view -h
alevin-fry-view 0.5.1
Avi Srivastava <[email protected]>
Hirak Sarkar <[email protected]>
Dongze He <[email protected]>
Mohsen Zakeri <[email protected]>
Rob Patro <[email protected]>
View a RAD file
USAGE:
alevin-fry view [OPTIONS] --rad <RADFILE>
OPTIONS:
-h, --help Print help information
-H, --header flag for printing header
-o, --output <RADFILE> output plain-text-file file
-r, --rad <RADFILE> input RAD file
-V, --version Print version information
If you don't pass a file to -o, then it will print to stdout by default. So you could do something like:
alevin-fry view -r collated_rad | grep cell_barcode_of_interest > interesting_records.txt
to get all of the records corresponding to a barcode of interest.
Hmm, cr-like-em doesn't make any difference to that zero tail...
With cr-like (i.e. same plot as above):

With cr-like-em:

The deduplication rate also appears to be tighter.
With cr-like:

With cr-like-em:

Thanks for the view explanation @rob-p. I've done this for the zero-gene barcode with the highest number of reads associated to it. The last column of this output gives the ensemble ID, and counting by this we get:
ENSG N
1: ENSG00000278996.1-I1 313651
2: ENSG00000280441.3-I 313651
3: ENSG00000163046.16-I1 125
4: ENSG00000230876.8-I 87
5: ENST00000629969.1 34
---
1233: ENST00000684490.1 1
1234: ENST00000684676.1 1
1235: ENST00000692836.1 1
1236: ENST00000695035.1 1
1237: ENST00000695038.1 1
i.e. 627302 out of 629498 reads for these barcodes are derived from two genes.
Any thoughts? I've had a look at those genes and I don't really get much from what I see...
Ahhh are these huge counts multimappers?
I'm not really sure what a multimapper is... I've had a look at those two transcripts, but couldn't really see any way to check if they were very similar.
Is that what you meant, or was it something else?
So each entry output by view is a mapping record. But the same read can have many mapping records if the read matches more than one place. This would show up in the output as the same read id having multiple mappings to different genes. I think we also explicitly print a tag for the number of hits for each read (NH, I think).
Ok, yes I think that is the case. These are outputs for the top 40 zero-gene barcodes, with the NH variable added (NH is dominated by 2 and 3, so I lumped the others together):
ensg N_total NH:2 NH:3 Other
1: ENSG00000278996.1-I1 2838790 2687637 151040 113
2: ENSG00000280441.3-I 2838790 2687637 151040 113
3: ENSG00000162669.16-I2 114504 0 114493 11
4: ENSG00000172572.7-I 30160 0 30082 78
5: ENSG00000163046.16-I1 6301 0 6168 133
6: ENSG00000230876.8-I 665 1 0 664
7: ENST00000631211.1 339 313 0 26
8: ENST00000629969.1 314 313 0 1
9: ENSG00000137571.11-I1 182 0 168 14
10: ENSG00000268981.5-I 169 0 117 52
11: ENSG00000008282.9-I2 78 0 0 78
12: ENSG00000078328.23-I 53 3 3 47
13: ENSG00000259053.1-I 50 3 1 46
14: ENSG00000140025.16-I 49 3 1 45
15: ENSG00000015568.13-I8 47 12 12 23
16: ENSG00000236332.2-I1 32 4 0 28
17: ENSG00000150275.20-I 31 0 0 31
18: ENSG00000198586.14-I 30 0 0 30
19: ENSG00000187323.12-I2 29 3 0 26
20: ENSG00000153233.13-I 28 1 0 27
Ok, so this suggests that most of these reads are mapping to > 1 distinct gene (and therefore being discarded). Also, most of them are mapping to (what looks like) the same subset of distinct genes. This doesn't quite explain why those counts don't show up in cr-like-em, but let's table that for a moment.
When a read maps to multiple locations, the rules for how it are resolved depend on what those locations are. If they are transcripts of the same gene (spliced) it will be assigned to that gene. If they are spliced and unspliced regions of the same gene, it will be assigned to that gene (as an ambiguous UMI). However, if the mapping is to transcripts of different genes, and (as @csoneson noted) neither gene has a larger number of reads displaying that UMI, then the UMI is not uniquely resolvable to a gene and will be discarded. The top two rows, at least, seem to fall into this category as there are two distinct lncRNA to which the reads map equally well.
So overall, one observation here is that, for these cells, it looks like the vast majority of mappings come from only a handful of genes (perhaps as few as 2) where almost all of the reads map to more than one gene.
Ok, so I think I'm happy with how the mapping works, and why we end up excluding those reads.
The thing that's confusing me is why we end up with some barcodes mapping just to those two genes... I've just checked, and it's not just most of the reads, for almost all of the top 40 barcodes it's almost exclusively those two genes:
bc N prop_top2
1: CB:TGCCCTCCAATGGATC 166124 0.999
2: CB:CCGACTTTCGTTCTTG 142004 0.999
3: CB:GGTGCGTCTTGCGCCG 113253 0.999
4: CB:CGTCACTACCTCCCCG 108055 0.999
5: CB:AGGCCGTCCCTCTTAA 105101 0.999
6: CB:GGTCCGTCTTGCGCCG 98084 0.999
7: CB:TGGCGAATGCTTTCGC 98432 0.999
8: CB:ATCATGGCCTCAGTTC 87362 0.999
9: CB:TGGCAAATGCTTTCGC 84189 0.999
10: CB:CCCTCCTAATCATGGC 75437 0.999
11: CB:GATCGTCTTCGAACCT 211216 0.998
12: CB:GAGTCCGTCTTGCGCC 278231 0.998
13: CB:TGGCGCAATACGAATG 94752 0.998
14: CB:TTCGGTCTTGATTAAT 629498 0.997
15: CB:TGCCCTACAATGGATC 117496 0.996
16: CB:GTTACGACTTTTACTT 101171 0.996
17: CB:CGACTTTTACTTCCTC 149619 0.995
18: CB:GTCTTCGAACCTCCGA 136971 0.995
19: CB:GCAATCATACTCCCCC 130063 0.995
20: CB:GCTCTGGTCCGTCTTG 87547 0.995
21: CB:ATCCACCAACTAAGAA 219455 0.994
22: CB:TCAGCTTTGCAACCAT 93262 0.994
23: CB:CAACCATACTCCCCCC 93048 0.994
24: CB:GTAGGCCCCGCTTTCA 88253 0.994
25: CB:GTCTCGTTCGTTATCG 285555 0.993
26: CB:CGTTCGTTATCGGAAT 483959 0.992
27: CB:ACGTCAATTCCTTTAA 120961 0.992
28: CB:GCCTCTAGATAGTCAA 89414 0.991
29: CB:CCTCAGTTCCGAAAAC 88648 0.990
30: CB:GCGACCATACTCCCCC 88433 0.985
31: CB:TGAGCCGCAGGCTCCA 120459 0.984
32: CB:TGACTTTTACTTCCTC 83973 0.984
33: CB:CTCGGGTGGCTGAACG 60776 0.971
34: CB:AATCGGTAGTAGCGAC 92626 0.968
35: CB:GTGCGGTGTGTACAAA 153338 0.959
36: CB:AAGGCAGGGACTTAAT 69945 0.949
37: CB:GCACTTACTGGGAATT 161946 0.910
38: CB:TTGCCGTATCGTTCCG 107710 0.811
39: CB:GACCTGGGCGGGATTC 99845 0.782
40: CB:CTCGTACTGAGCAGGA 230899 0.680
What could be going on here, so that for a small but non-zero proportion of droplets they only map to these two v similar genes?
[just saw you had also noticed this in your edited comment!]
Doesn't look like using the cellranger reference actually makes much difference. Here is the genes vs barcodes plot for the same sample, mapped to the cellranger reference:

Hi @wmacnair,
Thanks for all of the useful diagnostic info. Regarding:
The thing that's confusing me is why we end up with some barcodes mapping just to those two genes... I've just checked, and it's not just most of the reads, for almost all of the top 40 barcodes it's almost exclusively those two genes:
The only way to really dig down into that is to try to look at the sequencing reads themselves that are inducing those mappings. Since the RAD file encodes where the reads maps but, for purposes of keeping the files compact and fast to parse, doesn't write down the reads themselves (like a SAM file). If you really want to track this down, you could certainly go back to the original FASTQ files, grep for the barcode of interest in the technical read to extract the read names, and then pull those read names out of the biological read file. Then you would have the actual sequence reads that are inducing these mappings.
I just remembered that I still had this issue open, apologies...!
I've reread through the thread, and I remain confused about what was going on in my samples. I have moved on by (a) using the default 10x reference, rather than rolling my own from the newest Gencode data, and (b) using simpleaf.
It doesn't seem like much was gained from the discussion, apart from teaching me some basics of mapping 😅 However, reading through again it was clear to me that it would be great to have a small, purpose-built standalone R package for loading simpleaf / alevin-fry outputs into R.
At the moment you suggest using fishpond::loadFry. This works fine, but I find it a bit unsatisfying, partly because it's not the main purpose of the package, but also some additional functionality would be great.
For example, I would love to see a package, e.g. simpleafR, that does:
- loading processed data from multiple directories in one function call
- excluding barcodes with zero reads
- produce either
sceorSeuratobjects as outputs - (perhaps an option would be to combine this with
alevinQC?)
I'm aware that you must have a very long list of things to do, and you've probably already considered this as something you may do. So just consider this as an additional point of support for if and when you get it :)
Thanks! Will
Hi @wmacnair,
Thanks for the thoughtful suggestions! Regarding alevinQC, we actually have another project in the works that is designed to make an even easier-to-use QC report. It's based on scran.js and bkana and so it runs entirely in the browser :). More on that soon.
Regarding your other suggestion, I'd like to pull @mikelove and @DongzeHE into this discussion. That is, I know that Mike likes to think of fishpond as a general catch-all package for lots of salmon/alevin/alevin-fry related functionality. However, if you think it would be useful to have a more focused / dedicated package for loading and doing basic processing of simpleaf output in R, then I think this is worth a broader discussion. I'll open a separate discussion item about this :).
Best, Rob
We have some stuff in fishpond for import and processing, and welcome to consider other things to go in there. Just need docs and tests really. There’s even some new stuff for working with EC counts.
Recently we took out non-R code to move to a separate package for ease of installation. EDS package on Bioc (Avi’s) now has the original alevin data storage format which is read in with C++.