vireo Demultiplexing multiomic sequencing data

Demultiplexing multiomic sequencing data

Open DHelix opened this issue 3 months ago • 2 comments

Hi @huangyh09,

First of all, huge thanks for developing Vireo! I've been testing it using a synthetic pool (3 donors), and I've noticed a high number of unassigned cells, particularly from one donor, based on scRNA-seq data alone. I found a potential solution by combining scRNA and scATAC data to increase the coverage, described in this https://github.com/single-cell-genetics/vireo/issues/39#issuecomment-1027488456: "... you can use bcftools concat if you have *cells.vcf.gz (by using --genotype in cellsnp-lite). Alternatively, you may try combining the sparse matrices directly."

So I tried:

Ran cellsnp-lite on scRNA and scATAC data separately, with --genotype
Sorted and indexed the two cellSNP.cells.vcf.gz files, generated in Step 1:

# scRNA
bcftools sort \
-m 2G \
-o ./scRNA/cellSNP.cells.vcf.sort.gz \
-O z9 \
-T TMP_DIR \
--write-index \
./scRNA/cellSNP.cells.vcf.gz

# scATAC
bcftools sort \
-m 2G \
-o ./scATAC/cellSNP.cells.vcf.sort.gz \
-O z9 \
-T TMP_DIR \
--write-index \
./scATAC/cellSNP.cells.vcf.gz

Concatenated the two cellSNP.cells.vcf.sort.gz files

bcftools concat \
--allow-overlaps \
-o ./scRNA_scATAC/cellSNP.cells.vcf.gz \
-O z9 \
--threads 32 \
./scRNA/cellSNP.cells.vcf.sort.gz \
./scATAC/cellSNP.cells.vcf.sort.gz

Ran Vireo on the concatenated cellSNP.cells.vcf.gz file

vireo \
-c ./scRNA_scATAC/cellSNP.cells.vcf.gz \
-N 3 \
-o ./scRNA_scATAC/sd1 \
--randSeed=1 \
-p 16

When I ran Vireo separately on the scRNA and scATAC data (providing the cellsnp-lite output folders, rather than the cellSNP.cells.vcf.gz files), it worked well and usually finished in < 20 mins. However, when I demultiplexed using the combined cellSNP.cells.vcf.gz file, it ran for several hours and finally got the following error:

[vireo] Loading cell VCF file ...
[vireo] Demultiplex 18491 cells to 3 donors with 908898 variants.
Traceback (most recent call last):
  File "/projects/Installs/python_virtualenv/vireo/bin/vireo", line 8, in <module>
    sys.exit(main())
  File "/projects/Installs/python_virtualenv/vireo/lib/python3.7/site-packages/vireoSNP/vireo.py", line 209, in main
    nproc=options.nproc)
  File "/projects/Installs/python_virtualenv/vireo/lib/python3.7/site-packages/vireoSNP/utils/vireo_wrap.py", line 76, in vireo_wrap
    pool = multiprocessing.Pool(processes = nproc)
  File "/linux-x86_64-centos7/python-3.7.2/lib/python3.7/multiprocessing/context.py", line 117, in Pool
    from .pool import Pool
  File "/linux-x86_64-centos7/python-3.7.2/lib/python3.7/multiprocessing/pool.py", line 17, in <module>
    import queue
  File "/linux-x86_64-centos7/python-3.7.2/lib/python3.7/queue.py", line 16, in <module>
    from _queue import Empty
ImportError: /linux-x86_64-centos7/python-3.7.2/lib/python3.7/lib-dynload/_queue.cpython-37m-x86_64-linux-gnu.so: failed to map segment from shared object: Cannot allocate memory

I'm hoping you could give me some suggestions:

Did I do it correctly?
Could you please provide more details on "Alternatively, you may try combining the sparse matrices directly"?
What's the best approach to combine scRNA and scATAC for demultiplexing?
Do you think combining scRNA and scATAC data can also improve doublet detection?

Thanks a lot for your time!

Apr 01 '24 07:04 DHelix

Hi, It seems that the cellSNP.cells.vcf.gz file generated by concatenating the scRNA and scATACcellSNP.cells.vcf.gz files using bcftools concat is too large (740M). I wonder if it's possible to generate the cellSNP.tag.AD.mtx, cellSNP.tag.DP.mtx, cellSNP.base.vcf.gz, and cellSNP.samples.tsv files from the cellSNP.cells.vcf.gz file? Thanks!

Apr 03 '24 00:04 DHelix

Hi, it looks like after concatenating, you got 908898 SNPs, which is quite a lot.

If your scATAC is better covered, you may consider demultiplexing just with scATAC. Also, the inferred genotype there can be used as input for demultiplexing scRNA if needed.

In either case, I never tested these and it only based on experiences in other settings, so your results may be different.

Yuanhua

Apr 05 '24 03:04 huangyh09

vireo vireo copied to clipboard

Demultiplexing multiomic sequencing data

vireo
vireo copied to clipboard