vireo
vireo copied to clipboard
Large number of `doublet` and `unassigned`, most lost from specific donor(s), after removing doublets
Hi,
I'm running cellsnp-lite
(v1.2.0
) and vireo
(~v0.5.5
~ v0.5.0
from pip) to demultiplex a 3' 10X scRNA-seq dataset with cells from 8 human donors.
These cells were pooled and run over 3 separate 10x captures, so I have appended a unique capture-specific suffix to the CB
cell barcode in the BAM file using appendCB
(https://github.com/ruqianl/appendCB) to distinguish potential duplicate cell barcodes across captures.
The donors are not evenly pooled; I expect 4 donors to have a similar, larger number of cells and the other 4 donors to have a similar, smaller number of cells.
I run cellsnp-lite
+ vireo
roughly as follows:
REGIONSVCF="genome1K.phase3.SNP_AF5e4.chr1toX.hg38.vcf.gz"
cellsnp-lite --samFile ${BAM} \
--outDir ${CELLSNPOUTDIR} \
--regionsVCF ${REGIONSVCF} \
--barcodeFile barcodes.txt \
--nproc 20 \
--minMAF 0.1 \
--minCOUNT 20 \
--gzip
vireo --cellData=${CELLSNPOUTDIR} \
--nDonor=8 \
--outDir=${VIREOOUTDIR}
[vireo] Loading cell folder ...
[vireo] Demultiplex 62746 cells to 8 donors with 182385 variants.
[vireo] lower bound ranges [-13206657.4, -11809108.8, -11654733.3]
[vireo] allelic rate mean and concentrations:
[[0.028 0.451 0.941]]
[[13738413.9 12786890.7 5384179.4]]
[vireo] donor size before removing doublets:
donor0 donor1 donor2 donor3 donor4 donor5 donor6 donor7
10645 5468 2069 11714 13229 740 15329 3553
[vireo] final donor size:
donor0 donor1 donor2 donor3 donor4 donor5 donor6 donor7 doublet unassigned
9380 3082 31 10633 11835 120 14287 3186 8627 1565
You can see in the final few lines that lots of cells are 'lost' following the removal of doubblets, which particularly noticeable for donor1
donor2
and donor5
as a percentage of the initial assignments.
I'm looking for some suggestions on how I can debug this and for possible causes of this behaviour? This is not the only dataset for which I've seen this issue and I can share some other examples if that is helpful.
For the most part I've had a lot of success with vireo
so thank you for developing it!
Cheers, Pete
Hi Pete,
Thanks for sharing you experience. the results indeed look not convincing. I noticed that the cell numbers are quite high, this might make the model easier to stuck at local optima. One quick thing you can try is increase the number of initializations by --nInit
, e.g., to 200. You can use -p
to increase the number of processes for parallel computing, while it will use more memory.
Let me know how it goes. Yuanhua
Thanks, Yuanhua.
I'm re-running with --nInit 200
and I'll report back the results.
Increasing to --nInit 200
gave me:
[vireo] Loading cell folder ...
[vireo] Demultiplex 62746 cells to 8 donors with 182385 variants.
[vireo] lower bound ranges [-13267522.5, -11771899.8, -11602801.9]
[vireo] allelic rate mean and concentrations:
[[0.031 0.452 0.934]]
[[14038312.6 12274692.6 5596478.8]]
[vireo] donor size before removing doublets:
donor0 donor1 donor2 donor3 donor4 donor5 donor6 donor7
2777 3722 11969 1737 11350 1861 16069 13261
[vireo] final donor size:
donor0 donor1 donor2 donor3 donor4 donor5 donor6 donor7 doublet unassigned
2284 3205 10634 1478 9410 36 14445 11851 8455 948
[vireo] All done: 498 min 4.3 sec
Increasing to --nInit 1000
gave me:
[vireo] Loading cell folder ...
[vireo] Demultiplex 62746 cells to 8 donors with 182385 variants.
[vireo] lower bound ranges [-13272912.3, -11791179.9, -11605879.4]
[vireo] allelic rate mean and concentrations:
[[0.027 0.45 0.937]]
[[13773588.1 12622406.3 5513489.7]]
[vireo] donor size before removing doublets:
donor0 donor1 donor2 donor3 donor4 donor5 donor6 donor7
15976 2273 4152 1516 10538 11619 13133 3538
[vireo] final donor size:
donor0 donor1 donor2 donor3 donor4 donor5 donor6 donor7 doublet unassigned
14430 25 3352 126 9379 10638 11839 3185 8578 1194
I'm a little concerned that varying --nInit
changes results for 'rare' donors so much, e.g., although I might like to believe the --nInit 200
results because only 1 donor has < 1000 cells, I would naively expect that the --nInit 1000
results should be more accurate/precise/reliable.
Aside: the version of vireo
I have (v0.5.5
installed from https://pypi.org/project/vireoSNP/) does not have the -p
/--nproc
option (it's not documented when vireo --help
is run and setting it leads to an error vireo: error: no such option: --nproc
)
This discordance between --nInit 200
and --nInit 1000
is indeed concerned. You meant these 62746 are from 3 captures, so each has ~21K cells? Eventhough the number of doublets could still be high and some donors may become minor, both of which can increase the challenge in donor identification.
One possibly way is to perform the demultiplexing on each of the three sets separately, and link them. So in each set, the doublets may not be so high.
For the version of vireo
, I guess you are still using an older version (you can type vireo
in terminal). The PyPI sometimes fails to replace the older version, especially installed from github initially. In the v0.5.5, it should support --nproc
.
Yuanhua
You meant these 62746 are from 3 captures, so each has ~21K cells?
Yes, that's correct.
One possibly way is to perform the demultiplexing on each of the three sets separately, and link them. So in each set, the doublets may not be so high.
This is how I initially performed the demultiplexing. However, I was unsure how to best link the donors between each capture. Could you please suggest how to do this using the vireo output (e.g., which files, fields,etc. to parse and link).
For the version of vireo, I guess you are still using an older version (you can type vireo in terminal). The PyPI sometimes fails to replace the older version, especially installed from github initially. In the v0.5.5, it should support --nproc.
Oh, that's surprising to me but you're correct, I am running vireo 0.5.0
(base) vc7-shared 501 % vireo
Welcome to vireoSNP v0.5.0!
Good to know. You can use the vireoSNP.vcf.match_VCF_samples
to align the output donor genotype files GT_donors.vireo.vcf.gz
. Example usage here.
The genotype similarity between samples estimated from two batches can be used to diagnose the deconvolution.