chromap icon indicating copy to clipboard operation
chromap copied to clipboard

low map rate for 10X scATAC-seq

Open badoi opened this issue 3 years ago • 24 comments

I'm aligning 10x genomics scATAC-seq to the rheMac10 genome. The sequencing is 50bp for R1 and R2 paired-end sequencing on the Novaseq. The genome was built with k17 and w7 for min frag 30. But the alignment rate is 7% which is too low. For 10x scATAC-seq datasets, how was the genome built to better map fragments as low as 20 or 30 bp after index trimming? Are there any other advice to work with this short of sequencing?

In a different dataset where the sequencing strategy was 100bp for R1 and R2, the same macaque index aligned 91% of the reads.

Mapped all reads in 55348.73s.
Number of reads: 1132465712.
Number of mapped reads: 77938628.
Number of uniquely mapped reads: 72568652.
Number of reads have multi-mappings: 5369976.
Number of candidates: 781477061.
Number of mappings: 77938628.
Number of uni-mappings: 72568652.
Number of multi-mappings: 5369976.
Number of barcodes in whitelist: 2890.
Number of corrected barcodes: 41349117.
Sorted, deduped and outputed mappings in 17.12s.

chromap_1741011_1_out.txt

badoi avatar Jul 02 '21 19:07 badoi

Can you share the command line you used to run Chromap? It seems that your barcode whitelist file does not match the barcodes in your data. So most of the reads are dropped since their barcodes are not on the whitelist.

haowenz avatar Jul 03 '21 02:07 haowenz

The command I used was the following. I'll check but I remember that 10x only had 1 ATAC barcode list, which is the list I provided. The issue first is that only 77938628/ 1132465712 reads mapped, which is ~6% of reads.

BARCODE_LIST=/home/bnphan/src/cellranger-atac-1.2.0/cellranger-atac-cs/1.2.0/lib/python/barcodes/737K-cratac-v1.txt
GENOME_FASTA=/home/bnphan/resources/genomes/rheMac10/rheMac10.fa
CHROMAP_IDX=/home/bnphan/resources/genomes/chromap/rheMac10/index
ALIGN_BED=${SAMPLE}.aln.bed

~/src/chromap-0.1_x64-linux/chromap --preset atac \
-x $CHROMAP_IDX -r ${GENOME_FASTA} -t 8 \
--remove-pcr-duplicates-at-cell-level \
--bc-error-threshold 2 \
-1 ${FQ1} -2 ${FQ2} -b ${BC1} -o ${ALIGN_BED} \
--barcode-whitelist ${BARCODE_LIST}

badoi avatar Jul 03 '21 03:07 badoi

Chromap will check if the barcode is in the whitelist, if not then it tries to correct the barcode. And if the barcode is still not in the whitelist, it will skip mapping the reads. Can you try a run with the parameters you used and also with "--output-mappings-not-in-whitelist" to see if the mapping rate is still low?

Moreover, can I know the 10X assay you used? Single Cell ATAC or Single Cell Multiome? They have different barcode whitelists. Also the time looks weird. Is it possible for you to share 1M read pairs so we may have a look?

haowenz avatar Jul 03 '21 03:07 haowenz

Thanks I'll run with that parameter. I tried aligning over the weekend with K=15 and W=7 and got a lower map rate, so I'll try to go for higher K' and W' indices then with your parameter to see if the barcode whitelist is the issue.

Our collaborators used the scATAC protocol. They haven't tried the Multiome yet.

chromap_1741095_1_out.txt

badoi avatar Jul 05 '21 16:07 badoi

Based on the two lines: Number of barcodes in whitelist: 2890. Number of corrected barcodes: 41349117.

It seems only a tiny fraction of reads' barcodes are in the whitelist. Could you please check whether the barcode used in the scATAC-seq data and the barcode whitelist are matched? Thank you.

mourisl avatar Jul 05 '21 22:07 mourisl

Hi all, I reached out to 10x and they advised that some sequencers provide the reverse complement of the barcodes, so I made a rev-comp whitelist. I get more hits w/ that list, and rerunning now w/ an index built with K20 W7. That should fix both problems of low alignment rate and not enough matches w/ the whitelist. I'll confirm this is indeed correct once the run finishes.

For others who find it useful, the unix command to make a rev comp of the whitelist:

cat 737K-cratac-v1.txt | tr ACGT TGCA | rev | sort > 737K-cratac-v1_revcomp.txt

badoi avatar Jul 07 '21 14:07 badoi

Please also try a run with index built with "-k 17 -w 7", which is default. It may give you the better results. Increasing k seems not necessary to me.

haowenz avatar Jul 07 '21 15:07 haowenz

The initial runs were with an index built w/ K17 W7, which produced very low map rates.

badoi avatar Jul 07 '21 15:07 badoi

Did you try "--output-mappings-not-in-whitelist" with "k17 w7" and get low mapping rate? Or you can just remove "-b" and run your data as bulk ATAC-seq data and see if the reads map.

It seems to me that you got you low mapping rate in the initial run since your barcode whitelist did not have the reverse complement of the barcodes. Those reads are dropped without even trying the mapping procedure. In this case you will always get low mapping rate, and the value of k and w have no effect on the results. When you use the right barcode whitelist file, the default parameters are supposed to work. Otherwise, there might be other problems.

You can try larger k if you want. But make sure you use an odd value for k (e.g, 19, 21).

haowenz avatar Jul 07 '21 16:07 haowenz

@mourisl I think we should add a new mapping stats saying the number of reads skipped since their barcodes are not on the whitelist. This would help the users and us to identify the problem in the data. What do you think?

haowenz avatar Jul 07 '21 16:07 haowenz

Yes. We should be more clear that Chromap will skip the reads whose barcodes are not in the whitelist, at least in the README. I think the line "Number of barcodes in whitelist: 2890. Number of corrected barcodes: 41349117." serves the purpose, but the user might not know that the other reads are filtered by default.

mourisl avatar Jul 07 '21 16:07 mourisl

Now that you mentioned all of that, I think I get the error now. I aligned w/ the default index k17 w7, with the reverse complement barcode list and looks like it's working well. Thanks for your feedback, and I agree that a bit clearer mapping statistics could help warn about this quirk in the future.

Number of reads: 1096959080.
Number of mapped reads: 992840174.
Number of uniquely mapped reads: 917016382.
Number of reads have multi-mappings: 75823792.
Number of candidates: 9525419205.
Number of mappings: 992840174.
Number of uni-mappings: 917016382.
Number of multi-mappings: 75823792.
Number of barcodes in whitelist: 482067912.
Number of corrected barcodes: 44422687.
Sorted, deduped and outputed mappings in 346.11s.
# uni-mappings: 213743382, # multi-mappings: 28074758, total: 241818140.
Number of output mappings (passed filters): 207028528
Total time: 32917.30s.

badoi avatar Jul 08 '21 01:07 badoi

Thank you @badoi for posting the reverse-complement tip. I was having a similar map rate issue and using the revcomp whitelist fixed it for me 👍

Just as a suggestion to the authors/developers, this would be a good thing to make a note of somewhere in the README (unless you have and I missed it), since this will be a problem for anyone aligning scATACseq reads sequenced via the reverse complement workflow (including NovaSeq w/ v1.5 reagents, HiSeq 4000, NextSeq, etc.).

Additionally, to ease some of the confusion around how mapping rates are achieved, it might make sense to order the final alignment statistics in the order that they are prioritized for filtering. Without the above thread, I also would not have known that only reads with a confirmed cell barcode are used for mapping (though of course, that makes a lot of sense for efficiency), since that stat is one of the last reported.

dannyconrad avatar Nov 17 '21 21:11 dannyconrad

Thank you so much for the suggestions! They make a lot of sense. Li and I will take care of this.

haowenz avatar Nov 18 '21 00:11 haowenz

I should also add that I am blown away with how fast Chromap is, by the way. I couldn't believe my eyes when you reported 16X speed enhancement over cellranger-atac, yet here I am aligning 5 different single-cell libraries in a single workday. Excellent work :)

dannyconrad avatar Nov 18 '21 01:11 dannyconrad

Thank you for the suggestions and the information on the reverse-complement barcodes. We just added a check of whether the barcodes from the data are too different from the whitelist. If they are too different, Chromap will stop early and warn the user about the whitelist issue. So you don't need to wait until the end to find that the whitelist needs to be modified. Hope this will be helpful.

mourisl avatar Nov 18 '21 18:11 mourisl

Excellent! That's a much more elegant solution, since it should also be useful in cases where users mistakenly use the wrong barcode whitelist. Thanks so much for addressing that so quickly.

dannyconrad avatar Nov 18 '21 22:11 dannyconrad

@mourisl @haowenz , does it make sense to add another option (e.g. '--barcode-reversion') to reverse complement barcode FASTQ? As @dannyconrad mentioned, Illumina's new machines (e.g. NextSeq, NovaSeq) implement the reverse complement workflow, in which case index2 (barcode read for 10x) will be reverse complemented. I expect that having reverse complement of the barcode will be a very common use case. Reversing the white list is not an elegant solution and may introduce problems for 10x multiome (how to translate reverse complemented atac barcodes to RNA barcodes?).

Please let me know what you think.

Best, Bo

bli25 avatar Dec 23 '21 01:12 bli25

I am planning to add the strand information to the option "--read-format" so Chromap will reverse the barcode or sequence when reading in the data. It should be done in the next a few days. Thanks for the suggestions!

mourisl avatar Dec 23 '21 16:12 mourisl

@mourisl, awesome, thanks a lot for keep developing this great software!

bli25 avatar Dec 23 '21 22:12 bli25

We have extended the fields in the read-format option, and you can set the option like "--read-format bc:0:-1:-" to specify the reverse-complement barcode now.

mourisl avatar Dec 24 '21 15:12 mourisl

@mourisl ,

I detected one bug related to this new feature. When I set '--read-format bc:0:-1:-', I still got the following error message:

Less than 5% barcodes can be found or corrected based on the barcode whitelist. Please check whether the barcode whitelist matches the data, e.g. length, reverse-complement. If this is a false positive warning, please run Chromap with the option --skip-barcode-check.

bli25 avatar Feb 02 '22 17:02 bli25

Thanks for the testing! I will check the implementation.

mourisl avatar Feb 02 '22 17:02 mourisl

@bli25 Hi Bo, yes, indeed there is a bug at the checking stage. I've fixed that in the "custom_readformat" branch, could you please give it a try? Thank you.

mourisl avatar Feb 05 '22 18:02 mourisl