Smart-seq2?
I'm a little confused on how this pipeline works:
- Why does the preprocessing separate each input BAM into chromosomes? I'm trying to call variants genome-wide.
- Why does it need to scan read sequences for cell barcodes? I can see the utility of this for 10x but I have smart-seq data (demultiplexed BAMs). The germline calling produced a single VCF file containing chr20 variants separated by BAM file/cell, but somatic doesn't work similarly; I'm guessing it wants a bulk file?
I know this was only benchmarked for 10x but the publication mentions it should work with smart-seq as well. Let me know if there's anything I can do to make this run efficiently with my demultiplexed BAMs.
Hi @itslittman,
Thank you for your interest in our package!
- In the germline variant calling step, imputation is performed on the raw calls, so it's more efficient to process the data by chromosome. If you would like to get genome-wide variant calling, you can merge the individual VCF files together after processing.
- Germline variant calling is done via pseudobulk analysis, followed by imputation using the 1KG3 reference. However, in the somatic variant calling step, we aim to recover mutations at the single-cell level. This requires scanning the read sequences to identify and analyze cell barcodes.
Please let me know if you have any additional questions or need further clarification about the process!
Hi @ZiyiWang7 Thanks for the reply! I understand the need to do that for 10x Genomics data, but I have Smart-Seq2 data. I demultiplexed the FASTQ files before alignment, so each cell already has its own BAM/the separation of barcodes has already been carried out.
I understand I could theoretically merge the BAMs back into one bulk file and let Monopogen re-parse the barcodes read-by-read, but this would seem like a massive waste of resources considering that has already been done, and since merging into a bulk BAM would double the amount of storage I'd have to allocate to this project. Is there a way to streamline this for use with demultiplexed BAMS?
Hi, @itslittman @ZiyiWang7
Can you perform genome-wide germline calling using Smart-Seq2 data? I got the error:
Exception in thread "Thread-2" java.lang.RuntimeException: java.util.zip.DataFormatException: invalid code lengths set at net.sf.samtools.util.BlockGunzipper.unzipBlock(BlockGunzipper.java:112) at net.sf.samtools.util.BlockCompressedInputStream.inflateBlock(BlockCompressedInputStream.java:383) at net.sf.samtools.util.BlockCompressedInputStream.readBlock(BlockCompressedInputStream.java:365) at net.sf.samtools.util.BlockCompressedInputStream.available(BlockCompressedInputStream.java:109) at net.sf.samtools.util.BlockCompressedInputStream.read(BlockCompressedInputStream.java:238) at java.base/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:350) at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:393) at java.base/sun.nio.cs.StreamDecoder.lockedRead(StreamDecoder.java:217) at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:171) at java.base/java.io.InputStreamReader.read(InputStreamReader.java:186) at java.base/java.io.BufferedReader.fill(BufferedReader.java:160) at java.base/java.io.BufferedReader.implReadLine(BufferedReader.java:370) at java.base/java.io.BufferedReader.readLine(BufferedReader.java:347) at java.base/java.io.BufferedReader.readLine(BufferedReader.java:436) at blbutil.InputIt.next(InputIt.java:120) at blbutil.InputIt.next(InputIt.java:48) at vcf.RefIt.readLine(RefIt.java:288) at vcf.RefIt.lambda$fileReadingThread$15(RefIt.java:168) at java.base/java.lang.Thread.run(Thread.java:1570)
Do you have any idea about this? Thank you so much!