vireo
vireo copied to clipboard
Demultiplexing multiomic sequencing data
Hi @huangyh09,
First of all, huge thanks for developing Vireo! I've been testing it using a synthetic pool (3 donors), and I've noticed a high number of unassigned cells, particularly from one donor, based on scRNA-seq data alone. I found a potential solution by combining scRNA and scATAC data to increase the coverage, described in this https://github.com/single-cell-genetics/vireo/issues/39#issuecomment-1027488456: "... you can use bcftools concat if you have *cells.vcf.gz (by using --genotype in cellsnp-lite). Alternatively, you may try combining the sparse matrices directly."
So I tried:
- Ran
cellsnp-lite
on scRNA and scATAC data separately, with--genotype
- Sorted and indexed the two
cellSNP.cells.vcf.gz
files, generated in Step 1:
# scRNA
bcftools sort \
-m 2G \
-o ./scRNA/cellSNP.cells.vcf.sort.gz \
-O z9 \
-T TMP_DIR \
--write-index \
./scRNA/cellSNP.cells.vcf.gz
# scATAC
bcftools sort \
-m 2G \
-o ./scATAC/cellSNP.cells.vcf.sort.gz \
-O z9 \
-T TMP_DIR \
--write-index \
./scATAC/cellSNP.cells.vcf.gz
- Concatenated the two
cellSNP.cells.vcf.sort.gz
files
bcftools concat \
--allow-overlaps \
-o ./scRNA_scATAC/cellSNP.cells.vcf.gz \
-O z9 \
--threads 32 \
./scRNA/cellSNP.cells.vcf.sort.gz \
./scATAC/cellSNP.cells.vcf.sort.gz
- Ran Vireo on the concatenated
cellSNP.cells.vcf.gz
file
vireo \
-c ./scRNA_scATAC/cellSNP.cells.vcf.gz \
-N 3 \
-o ./scRNA_scATAC/sd1 \
--randSeed=1 \
-p 16
When I ran Vireo separately on the scRNA and scATAC data (providing the cellsnp-lite
output folders, rather than the cellSNP.cells.vcf.gz
files), it worked well and usually finished in < 20 mins. However, when I demultiplexed using the combined cellSNP.cells.vcf.gz
file, it ran for several hours and finally got the following error:
[vireo] Loading cell VCF file ...
[vireo] Demultiplex 18491 cells to 3 donors with 908898 variants.
Traceback (most recent call last):
File "/projects/Installs/python_virtualenv/vireo/bin/vireo", line 8, in <module>
sys.exit(main())
File "/projects/Installs/python_virtualenv/vireo/lib/python3.7/site-packages/vireoSNP/vireo.py", line 209, in main
nproc=options.nproc)
File "/projects/Installs/python_virtualenv/vireo/lib/python3.7/site-packages/vireoSNP/utils/vireo_wrap.py", line 76, in vireo_wrap
pool = multiprocessing.Pool(processes = nproc)
File "/linux-x86_64-centos7/python-3.7.2/lib/python3.7/multiprocessing/context.py", line 117, in Pool
from .pool import Pool
File "/linux-x86_64-centos7/python-3.7.2/lib/python3.7/multiprocessing/pool.py", line 17, in <module>
import queue
File "/linux-x86_64-centos7/python-3.7.2/lib/python3.7/queue.py", line 16, in <module>
from _queue import Empty
ImportError: /linux-x86_64-centos7/python-3.7.2/lib/python3.7/lib-dynload/_queue.cpython-37m-x86_64-linux-gnu.so: failed to map segment from shared object: Cannot allocate memory
I'm hoping you could give me some suggestions:
- Did I do it correctly?
- Could you please provide more details on "Alternatively, you may try combining the sparse matrices directly"?
- What's the best approach to combine scRNA and scATAC for demultiplexing?
- Do you think combining scRNA and scATAC data can also improve doublet detection?
Thanks a lot for your time!
Hi,
It seems that the cellSNP.cells.vcf.gz
file generated by concatenating the scRNA and scATACcellSNP.cells.vcf.gz
files using bcftools concat
is too large (740M).
I wonder if it's possible to generate the cellSNP.tag.AD.mtx
, cellSNP.tag.DP.mtx
, cellSNP.base.vcf.gz
, and cellSNP.samples.tsv
files from the cellSNP.cells.vcf.gz
file?
Thanks!
Hi, it looks like after concatenating, you got 908898 SNPs, which is quite a lot.
If your scATAC is better covered, you may consider demultiplexing just with scATAC. Also, the inferred genotype there can be used as input for demultiplexing scRNA if needed.
In either case, I never tested these and it only based on experiences in other settings, so your results may be different.
Yuanhua