cellsnp-lite icon indicating copy to clipboard operation
cellsnp-lite copied to clipboard

cellsnp-lite output not working with Vireo - reporting really high read depths

Open dannyconrad opened this issue 3 years ago • 1 comments

I have a scATACseq dataset where I'm trying to use the mitochondrial reads to demultiplex two mouse strains. When I run cellSNP and cellsnp-lite in mode 2 using what I believe to be matching parameters and the same input files, both identify all of the same SNPs, but the read depths reported in cellSNP.base.vcf.gz and the sparse matrices are different; the depths reported by cellsnp-lite are ~100X greater. When I run vireo on these two different outputs, the cellSNP results seem to properly demultiplex my donors, whereas the cellsnp-lite results leave all cells unassigned. Importantly, I noticed in the donor_ids.tsv file for the cellsnp-lite vireo run, that all cells are reported to have the same results:

cell	donor_id	prob_max	prob_doublet	n_vars	best_singlet	best_doublet	doublet_logLikRatio
AAACGAAAGGTCGTTT-1	unassigned	4.76e-01	4.79e-02	370	donor0	donor0,donor1	0.000

I guess I'm not sure if this is a bug with cellsnp-lite or if I did something drastically different running the two programs without realizing it. Any ideas?

Here are the two batches of code I ran and the relevant console outputs:

cellsnp-lite

cellsnp-lite \
-s ./chrM_CB.bam \
-b ./barcodes_mt_revcomp.tsv \
-O ./cellsnp-lite_DeNovo \
-p 40 \
--minMAF 0.1 \
--UMItag None \
--chrom M
--exclFLAG UNMAP,SECONDARY,QCFAIL
[I::main] start time: 2021-11-24 11:49:27
[W::check_args] Max depth set to maximum value (2147483647)
[I::main] mode 2a: pileup 1 whole chromosomes in 4787 single cells.
[W::csp_pileup_core] Combined max depth is above 1M. Potential memory hog!
[I::csp_pileup_core][Thread-0] processing chrom M ...
[I::csp_pileup_core][Thread-0] has pileup-ed in total 370 SNPs for chrom M
[I::main] All Done!
[I::main] end time: 2021-11-24 13:12:30
[I::main] time spent: 4983 seconds.

less cellsnp-lite_DeNovo/cellSNP.base.vcf.gz | head 
##fileformat=VCFv4.2
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
M	55	.	T	G	.	PASS	AD=146211;DP=489229;OTH=247
M	165	.	C	A	.	PASS	AD=142350;DP=607292;OTH=268
M	348	.	T	C	.	PASS	AD=103804;DP=350149;OTH=251

vireo \
-c ./cellsnp-lite_DeNovo/ \
-N 2 \
-o ./vireo/cellsnp-lite_DeNovo \
-p 40
[vireo] Loading cell folder ...
[vireo] Demultiplex 4787 cells to 2 donors with 370 variants.
[vireo] lower bound ranges [-114928564.4, -114928564.4, -114928564.4]
[vireo] allelic rate mean and concentrations:
[[0.01  0.284 0.99 ]]
[[5.00000000e+01 2.39958576e+08 5.00000000e+01]]
[vireo] donor size before removing doublets:
donor0	donor1
2394	2394
[vireo] final donor size:
unassigned
4787
[vireo] All done: 0 min 16.0 sec

cellSNP

cellSNP \
-s ./chrM_CB.bam \
-b ./barcodes_mt_revcomp.tsv \
-O ./cellSNP_DeNovo \
-p 40 \
--minMAF=0.1 \
--UMItag=None \
--chrom=M \
--maxFLAG=4096
[cellSNP] mode 2: pileup 1 whole chromosomes in 4787 single cells.
[cellSNP] Whole genome pileupped, now merging all variants ...
[cellSNP] 408 lines in final vcf file
[cellSNP] All done: 64 min 43.5 sec

less cellSNP_DeNovo/cellSNP.base.vcf.gz | head 
##fileformat=VCFv4.2
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chrM	55	.	T	G	.	PASS	AD=1924;DP=7534;OTH=2
chrM	165	.	C	A	.	PASS	AD=2100;DP=7740;OTH=0
chrM	348	.	T	C	.	PASS	AD=2749;DP=7402;OTH=2

vireo \
-c ./cellSNP_DeNovo/cellSNP.cells.vcf.gz \
-N 2 \
-o ./vireo/cellSNP_DeNovo \
-p 40
[vireo] Loading cell VCF file ...
[vireo] Demultiplex 4787 cells to 2 donors with 370 variants.
[vireo] lower bound ranges [-1585667.0, -1585667.0, -430338.7]
[vireo] allelic rate mean and concentrations:
[[0.038 0.907 0.938]]
[[1857003.8  450325.5  418458.7]]
[vireo] donor size before removing doublets:
donor0	donor1
1735	3052
[vireo] final donor size:
donor0	donor1	doublet	unassigned
1581	2891	286	29
[vireo] All done: 0 min 15.2 sec

dannyconrad avatar Nov 24 '21 22:11 dannyconrad

Hi, thanks for your feedback. Could you share the version of cellSNP and cellsnp-lite?

The reason why the read counts are different: cellSNP (actually the dependent pysam) has a default limitation that the max_depth (i.e., max pileup-ed read count) is 8000, so you may check that every DP<=8000 in the cellSNP output. But cellsnp-lite does not have this max_depth limitation, it will pileup as many reads as possible. (we are trying to add a --maxPileup option with similar function as max_depth, for the next release of cellsnp-lite)

The reason why cellsnp-lite+vireo does not work well: The large read counts in cellsnp-lite output makes it more likely for vireo to reach local optima so that the parameters of two donors become the same and hence vireo cannot assign the cells to certain donor.

Besides, vireo is designed for nuclear SNVs. For mito SNVs, you may want to try this tutorial, which was used by MQuad. Note that the duplicate reads should probably be removed beforehand, as there are not UMIs in your case.

Xianjie

hxj5 avatar Nov 25 '21 04:11 hxj5