svtyper icon indicating copy to clipboard operation
svtyper copied to clipboard

SVtyper with long reads

Open timp0 opened this issue 10 years ago • 7 comments

So - I'm trying to generate SV frequencies from long read data (ONT) aligned with a split-read aligner (BWA MEM) using SVtyper, but I keep getting an error:

Error: Traceback (most recent call last): File "/home/timp/Code/svtyper/svtyper", line 1078, in sys.exit(main()) File "/home/timp/Code/svtyper/svtyper", line 1070, in main args.debug) File "/home/timp/Code/svtyper/svtyper", line 767, in sv_genotype sample = Sample(bam_list[i], spl_bam_list[i], num_samp) File "/home/timp/Code/svtyper/svtyper", line 709, in init self.name = bam.header['RG'][0]['SM'] KeyError: 'RG'

Command:

##BWAMEM
~/Code/bwa/bwa mem -x ont2d /mithril/Data/NGS/Reference/human/hg19.fa ${fastq}_BC${bc}.fastq.gz >${outdir}/${prefix}.bwa.sam

samtools view -b -S ${outdir}/${prefix}.bwa.sam | samtools sort - ${outdir}/${prefix}.bwa
samtools index ${outdir}/${prefix}.bwa.bam

##From LUMPY readme                                                                                                                                                                                                                                                       

samtools view -b -F 1294 -S ${outdir}/${prefix}.bwa.sam >${outdir}/${prefix}.bwa.discordants.unsorted.bam

# Extract the split-read alignments                                                                                                                                                                                                                                       
samtools view -h -S ${outdir}/${prefix}.bwa.sam \
    | ~/Code/lumpy-sv/scripts/extractSplitReads_BwaMem -i stdin \
    | samtools view -Sb - \
    > ${outdir}/${prefix}.bwa.splitters.unsorted.bam

# Sort both alignments                                                                                                                                                                                                                                                    
samtools sort ${outdir}/${prefix}.bwa.discordants.unsorted.bam ${outdir}/${prefix}.bwa.discordants
samtools sort ${outdir}/${prefix}.bwa.splitters.unsorted.bam ${outdir}/${prefix}.bwa.splitters

samtools index ${outdir}/${prefix}.bwa.splitters.bam


~/Code/lumpy-sv/bin/lumpy \
    -mw 4 \
    -tt 0 \
    -sr id:${prefix}.bwa,bam_file:${outdir}/${prefix}.bwa.splitters.bam,back_distance:10,weight:1,min_mapping_threshold:20 \
    > ${outdir}/${prefix}.bwa.vcf


cat ${outdir}/${prefix}.bwa.vcf | \
    ~/Code/svtyper/svtyper \
    -B ${outdir}/${prefix}.bwa.bam \
    -S ${outdir}/${prefix}.bwa.splitters.bam \
    > ${outdir}/${prefix}.bwa.gt.vcf

timp0 avatar Jun 09 '15 17:06 timp0

I have no idea how svtyper will react to long-reads in general, but it looks like that problem is a result of improper read groups in your BAM. The easiest way to add readgroup ID and SM information is with bamaddrg

cc2qe avatar Jun 09 '15 17:06 cc2qe

Ok - I have added readgroup ID and SM information by adding an -R tag to my bwa command, but now I get a different error: "IndexError: list index out of range"

Error:

Traceback (most recent call last): File "/home/timp/Code/svtyper/svtyper", line 1078, in sys.exit(main()) File "/home/timp/Code/svtyper/svtyper", line 1070, in main args.debug) File "/home/timp/Code/svtyper/svtyper", line 767, in sv_genotype sample = Sample(bam_list[i], spl_bam_list[i], num_samp) File "/home/timp/Code/svtyper/svtyper", line 734, in init self.lib_dict[name].calc_insert_hist() File "/home/timp/Code/svtyper/svtyper", line 689, in calc_insert_hist med = median(valueCounts) File "/home/timp/Code/svtyper/svtyper", line 567, in median v = valueList[i] IndexError: list index out of range

Code:

##BWAMEM
~/Code/bwa/bwa mem -R "@RG\tID:${fastq}_BC${bc}\tSM:BC${bc}\tPL:ONT" -x ont2d /mithril/Data/NGS/Reference/human/hg19.fa ${fastq}_BC${bc}.fastq.gz >${outdir}/${prefix}.bwa.sam

samtools view -b -S ${outdir}/${prefix}.bwa.sam | samtools sort - ${outdir}/${prefix}.bwa
samtools index ${outdir}/${prefix}.bwa.bam

##From LUMPY readme                                                                                                                                                                                                                                                       

samtools view -b -F 1294 -S ${outdir}/${prefix}.bwa.sam >${outdir}/${prefix}.bwa.discordants.unsorted.bam

# Extract the split-read alignments                                                                                                                                                                                                                                       
samtools view -h -S ${outdir}/${prefix}.bwa.sam \
    | ~/Code/lumpy-sv/scripts/extractSplitReads_BwaMem -i stdin \
    | samtools view -Sb - \
    > ${outdir}/${prefix}.bwa.splitters.unsorted.bam

# Sort both alignments                                                                                                                                                                                                                                                    
samtools sort ${outdir}/${prefix}.bwa.discordants.unsorted.bam ${outdir}/${prefix}.bwa.discordants
samtools sort ${outdir}/${prefix}.bwa.splitters.unsorted.bam ${outdir}/${prefix}.bwa.splitters

samtools index ${outdir}/${prefix}.bwa.splitters.bam


~/Code/lumpy-sv/bin/lumpy \
    -mw 4 \
    -tt 0 \
    -sr id:${prefix}.bwa,bam_file:${outdir}/${prefix}.bwa.splitters.bam,back_distance:10,weight:1,min_mapping_threshold:20 \
    > ${outdir}/${prefix}.bwa.vcf


cat ${outdir}/${prefix}.bwa.vcf | \
    ~/Code/svtyper/svtyper \
    -B ${outdir}/${prefix}.bwa.bam \
    -S ${outdir}/${prefix}.bwa.splitters.bam \
    > ${outdir}/${prefix}.bwa.gt.vcf

timp0 avatar Jun 09 '15 19:06 timp0

Ok, this may be a larger problem. Are you reads single end or paired end? SVTyper needs paired reads to calculate an insert size distribution

cc2qe avatar Jun 09 '15 19:06 cc2qe

They are single end reads. I put (perhaps naively) the fully aligned bam into the -B, and the split reads (extracted with "extractSplitReads_BwaMem" from lumpy) into -S

What I'm trying to get out is the frequency of SV versus a background wild-type from a cancer sample - why do I need paired end? From a quick look at your code, it looks like you are calculating coverage in a region and determining number of breakpoints detected vs. depth of coverage?

timp0 avatar Jun 09 '15 19:06 timp0

Right, that's an accurate description of the tool, but unfortunately SVTyper will not work for your data without modifications. It expects paired-end reads because it was designed for Illumina short data, but you could probably fork it to operate on single-ends alone by removing the calls to calc_insert_hist() and count_pairedend()

cc2qe avatar Jun 09 '15 19:06 cc2qe

I actually plan to just be simpler than that for now - and use bedtools just to estimate coverage in the region - I'm using amplicons so I know where I'm looking.

That said, it may be an advantage given the recent work on long single read stuff (PacBio, ONT) to have an option or the ability to deal with this in SVtyper? Especially as long read offers many advantages for SV identification/typing.

W

On Tue, Jun 9, 2015 at 3:34 PM, Colby Chiang [email protected] wrote:

Right, that's an accurate description of the tool, but unfortunately SVTyper will not work for your data without modifications. It expects paired-end reads because it was designed for Illumina short data, but you could probably fork it to operate on single-ends alone by removing the calls to calc_insert_hist() and count_pairedend()

— Reply to this email directly or view it on GitHub https://github.com/cc2qe/svtyper/issues/11#issuecomment-110478621.

Winston Timp (410)-417-8467 Assistant Professor of Biomedical Engineering Johns Hopkins University

timp0 avatar Jun 10 '15 12:06 timp0

Ok, this may be a larger problem. Are you reads single end or paired end? SVTyper needs paired reads to calculate an insert size distribution

Same error [IndexError: list index out of range] with paired-end data. I'm trying to annotate structural variants vcf from "smrtsv2" (PacBio) with illumina reads for same sample aligned with BWA-mem.

Second sample also throws : ValueError: invalid literal for int() with base 10: '\xf0_\x05\x0cM\xb3\x02\x1c\xd9\x95\xc7'

osowiecki avatar Apr 22 '20 03:04 osowiecki