pavfinder
pavfinder copied to clipboard
false negative with "cannot map to exon" in log
I ran find_sv_transcriptome.py
on contigs assembled from a sample with a known FLT3-ITD. I do see the ITD sequence in some of the contigs, but I don't see the FLT3-ITD in the output file. For the contigs containing the ITD, I see the message printed:
Processing X:
X: cannot map event to exon
Here is an example contig containing the ITD, where I have the duplicated sequence bolded, and the exon 14 sequence in []. GGACTGTGGGGACAAGAGTAACTTTATTGAAAATACTAATCCTCCATGTTACTTCTGACTGGCCCTGAGTCTGGGAAGGCCGCCAAAGTGTCTAGGTGATGTATTACTCTTTATGGTAGAACACCTATTCATTATAAACTTCCCCCAATACAACCCCTGTTGTTGCAGAAATCTTAGGCTGTGACAACCATAGCTGCCTACACATTCCTTGTATCTTGGGGTAAAAGCACACGTGCTCTGGAAGGAATGTGTAGGTGGCTATGGGTGCACAATTTCAGGGGGTTCGTGAACTCCAGTTAAGACTTGCCCTAATTATACCATGTAAATAATTCAATAATGGGCAATTCTGTAGTAGAAATTTTATTCCCACCCATAAAATATATCACTAAATAGCTGAAAAATTTACATATTATTTTAAAACATAGACTTAAAAAATCATATTAGCTTCTCCTTAGCAAAATGCTTTTGTTTTATGTATTTACAAGAATATACTGTACTTCAGGTACACAATTCACTCAAGCCAGCCTGAGAAGGCCTTGGATGCAGATCAATGCTCCAATAAAGTTCATTATCAGCTCCTCCTGCCTTGTGACAGGATGATTTGATTTTACAAAAGTCCCTTTGAAAACAAGAGTAAACGCAGACAGCTTCTAGAGAAAAGTCTGGTGAAGCAGCAGTTGATAATAGATTTTCTTTTAGTGATGAAATTAATCTTGTTTTGGTAATCTACAGCCTGTTAGGGATAGGTGGAGGGATGAAGTCCTTAAAACTAAATTGTTCCTCTACGAATCTTCGACCTGAGCCTGCGGAGAGAGTAGCCCCAAATCCATCTCTCTGCTGAAAGGTCGCCTGTTTTGGTAGGTGTGAGGACATTCCGAAACACGGCCATCCACATTCTGATACATCGCTTCTTCTGCATCTGCCAGCTGACATCCTAAAAACGAAGTCAAATTAGGGAAGGATGGCCGTTTCCTTGAGTCAAAAGCCCAGCAGGATTGCATTATAATGTATATTTCTTCTGTAGCATAAAATGGCTGATCCATTTTAAATCCATTTTGAATCAGTTTGTAGAAGTTAGCATCAACCGGAATGCCAGGGTAAGGATTCACACCAAGTGAGAAGATTTCCCACAGTAATATTCCATATGACCAGACATCACTCTTAATGGTGTAGATGCCTTCAAACAGGCTTTCGGGGGCCATCCATTTTACAGGCAGACGGGCATTGCCCCTGACAACATAGTTGGAATCACTCATGATATCTCGAGCCAATCCAAAGTCACATATCTTCACCACTTTCCCGTGGGTGACAAGCACGTTCCTGGCGGCCAGGTCTCTGTGAACACACGACTTAAATTCCAGAAATTCCATTCCTTTGGCAACTTGATATGCAAAGCAAAGAAGATCTTCAAATGTAAGCACATTCAAGTCCTCCTCTTCTTCCAGCCTTTTTTGGTTTTCATATTCAATTTCATCTTCAGAGTGAAATGAATTCCCATGAAGCCCTGAGATTTGATCCGAGTCCGGGTGTATCTGAACTTCTCTTGAACCAGGCATGCTGGAATTTGGATGTGATTGGAAAGTGGGGTAAAAACTGAAATTGTGTTCCTTGAAAATCTCTGTCCAAGTCCTGTGAAATTTTTCTCTTTTACTTCTTAGATAGTTGAGAAGATCACCATAGCAACAGTATTCAAAAATCAAGTAAATTGGTCCTGACAGTGTGCACGCCCCCAGCAGGTTCACAATATTCTCGTGGCTTCCCAGCTGGGTCATCATCTTGAGTTCTGACATGAGTGCCTCTCTTTCAGAGCTGTCTGCTTTTTCTTTCAGCATTTTGACGGCAACCTGGATTGAGACTCCTGTTTTGCTAATTCCATAAGCTGTTGCGTTCATCACTTTTCCAAAAGCACCTGATCCTAGTACCTTCCCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACC[AAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTGTACCATCTGTAGCTGGCTTTCATACCTAAATTG]CTTTTTGTACTTGTGACAAATTAGCAGGGTTAAAACGACAATGAAGAGGAGACAAACACCAATTGTTGCATAGAATGAGATGTTGTCTTGGATGAAAGGGAAGGGGCCTGGAGAGTTTAAAAGGATCGTCTCACAAGATGTGCCAAGGGAATTGTATGCACAGCACTTGACCAGGAACCCTTTTATGGCTTCACTCATGTTTAGAGTACTGCTCGACACCCACTGTCCAAACACTTTTCTGTTAGCCTTTCTATTCCAGACTCCTTCTGTGATCTCTTCTGTGCAGTTGGGAGACTTGTCTGAACACTTCTTCCAGGTCCAAGATGGTAATGGGTATCCATCCGAGAAACAGGACGCCTGACTTGCCGATGCTTCTGCGAGCACTTGAGGTTTCCTTCTTATATTCAGCGTGAACATTTTGGTAAATTGGGCATCATCATTTTCTGCATGGAATATATATTCTCCTGGCTGGTGCTTATGATTGCAAAACTTGGATATGCTGTATCCGTTATCAAGACCCTTTTGCTCACAAGGAAATGATTTTCGAGAGAAGGTCCACGTACATCTGATTTGTGGGTAGGCTTTAAACCTGACAGAAAAACAAAACTCTTCATATTGGTCAATTTCATAATCTTCACTTGAATTGGTAGCATTTATAAATCCCTTTTCTACGATGGTAACCAAAGCTGATTGACTGGGATGCTTTGAAGAGGAACAAGTGTAGTATCCGGTGTCGTTTCTTGCCACTGATGATACAAAAGCAAACAGAATCCGTATCATAGTTCTGTTTGTTGAATAGGTACTCATCTCAAAGTAGTTGCCCTCCTCGAGTGCTTTGTTTTCTAATTCCCAGGTGAGCCCGAATCCATGGTTCACATGAACAGCTTTGCACCTTATCCATAAGGGTTCCCCTACTTTAAGAAATAATTGTGGCAATGTGGTCTGAGGAGTTTGATTTAGATCTATTGTGAACAGCCTGGTGCATTCCCTGCCCAGTTCATTTCTGGCACAGCACCTTATGTCCATCCCAAATAATTCATGAAGCACTTTTTCCTCCTTTTTAACAACAGCTGGACTTTCTTCTTTACAGCTTTCCCCCTGTGAATCGCAAAGCACCCATTCCACGATCGGCTCTGGAACGCTCTCAGATATGCAGACCAGGGCGTCCTGGTTTTCCATTTTTCTAAAGTAAGGTCTTCTTAATGTGTAAAGCAGGGTATTTCTTATACTCACTGTAAACAATATTGTGTAATTGGTAGCTTCACTCTGAATAAAAAGTAGGTATTCTCCAGCTTGGGTTTCTGTCATTTTCAAAATGACCATGGAAACAACTCCTCTGTTTTGTAAATCAAAATGTGGCTGGCAATTCAGGGAGCTGTGCTTAAAGACCCAGAGACAGGAAATGTTCCCTGGGGCATCGACCAGCACTTGCAGTGTGATGGAAGCAGATACATCCACTTCCACAGCGGCAGCTTCGTACACTGTCCCTGAGCTCTGGGGTCTCAACGCACACCCGAGGTCTTCCGGGGATTCTGATACCATGGGATATGATGATGACTTCCCCACTGATGAATCATTGTTCTTATGATTGATTAAAACACACTTGATCACAGGCAGATCTTGATTTGTAATAGTCCCAAATATCATTGCAGAAAAAACAACGAGCAGCGGCAGCTGGCCGCCGTCGCGCGCCAACGCCGGCATGGCCTCCGGAGCCCGGGGTCCCCAGGCCGCGCCGGCCCAGCCCTGCGATGCCGCCT
PAVFinder identifies ITD when the duplication is reported as an insertion in the CIGAR string or a split alignment from the aligner.
I tested your sequence with GMAP, and unfortunately it doesn't report the duplication as an insertion but rather as a bunch of base substitutions and small indels. I also tried Minimap2, again no insertion in the CIGAR string but a series of substitutions and possibly a couple of indels. Possibly the near exon-boundary location made it challenging for the aligners.
However, the event does get picked up if the alignment is done against the transcript sequence. So if I supplied the c2t
alignment (done by bwa-mem
against, for example, the test transcript sequence provided in the repo (pavfinder/test/transcriptome/refGene.fa
) which contains FLT3), the event is picked up - not labelled as an ITD but a "dup" (probably some coordinate issue doesn't qualify it as an ITD):
#chrom1 start1 end1 chrom2 start2 end2 name score strand1 strand2 orient1 orient2 event size gene1 transcript1 transcript_b
reak1 exon1 exon_bound1 gene2 transcript2 transcript_break2 exon2 exon_bound2 gene_5prime gene_3prime exon_5primeexon_3prime feature seq_id seq_breaks ins_seq homol_seq homol_seq_coords novel_seq novel_seq_coords copy_number_
change repeat_seq in_frame splice_motif probe support_span spanning_reads flanking_pairs
chr13 28608124 28608125 chr13 28608267 28608268 1 . + + R L dup 57 FLT3
NR_130706 1907 15 False FLT3 NR_130706 1854 14 False na na na na na test 1977
-2035 na na na GGA na na na na na CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAA
TTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG 1977-1978 na na
It's one of the main original purposes for including c2t
alignment, because the location of variants can make it tricky for splicing aligners but rather straightforward when no splicing needs to be considered.
I think my GMAP alignments do have the insertion. Here's an IGV screenshot:
Also, I did include the c2t alignment in my input, but did not see the event in my output. Here are my input files: pavfinder_input.zip
A different version of GMAP may be able to capture the insertion. Will take a look of your files.
Can I have the input fasta as well? thanks
Here's the input fasta: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz Also, here's the GTF in case that helps hg19.ncbiRefSeq.chrOnly.sorted.gtf.gz
I mean the contig sequences, not the reference genome sequence
oops, never mind, the sequences are there
There is indeed problem using c2g alone, but with c2t, I am able to get the ITD.
I don't have the reference transcripts fasta you used for generating your c2t bam, so I generated my own using extract_transcript_sequence.py
from the repo:
python extract_transcript_sequence.py hg19.ncbiRefSeq.chrOnly.sorted.gtf.gz hg19.ncbiRefSeq.fa hg19a.fa --only_longest --coding_first --index
(hg19a.fa is my version of hg19 without the "chr" prefix")
I then just used bwa mem
to align the contigs:
bwa mem -t24 hg19.ncbiRefSeq.fa pavfinder_input/GSM1636766_ITD_genes.transcripts.nr.fa | samtools view -bhS - -o GSM1636766_ITD_genes.c2t.bam
then use these files to run pavfinder:
python ~/work/git/pavfinder/pavfinder/scripts/find_sv_transcriptome.py --gbam pavfinder_input/GSM1636766_ITD_genes.c2g.sorted.bam --tbam GSM1636766_ITD_genes.c2t.bam --transcripts_fasta hg19.ncbiRefSeq.fa pavfinder_input/GSM1636766_ITD_genes.transcripts.nr.fa hg19.ncbiRefSeq.chrOnly.sorted.gtf.gz --r2c pavfinder_input/GSM1636766_ITD_genes.sorted.r2c.bam /projects/btl/rchiu/hg19a.fa test2
and here's the output:
#pavfinder 1.8.1
#2022-07-06:18:34:28 /home/rchiu/work/git/pavfinder/pavfinder/scripts/find_sv_transcriptome.py --gbam pavfinder_input/GSM1636766_ITD_genes.c2g.sorted.bam --tbam GSM1636766_ITD_genes.c2t.bam --t
ranscripts_fasta hg19.ncbiRefSeq.fa pavfinder_input/GSM1636766_ITD_genes.transcripts.nr.fa hg19.ncbiRefSeq.chrOnly.sorted.gtf.gz --r2c pavfinder_input/GSM1636766_ITD_genes.sorted.r2c.bam /proje
cts/btl/rchiu/hg19a.fa test2
#chrom1 start1 end1 chrom2 start2 end2 name score strand1 strand2 orient1 orient2 event size gene1 transcript1 transcript_break1 exon1 exon_bound1 gene2 t
ranscript2 transcript_break2 exon2 exon_bound2 gene_5prime gene_3prime exon_5prime exon_3prime feature seq_id seq_breaks ins_seq homol_seq homol_seq
_coords novel_seq novel_seq_coords copy_number_change repeat_seq in_frame splice_motif probe support_span spanning_reads flanking_pairs
13 28608124 28608125 13 28608267 28608268 1 . + + R L ITD 57 FLT3 NM_004119.3 1907 15 False F
LT3 NM_004119.3 1854 14 False na na na na na 5,77,25,26,71,73,74,89,28,30,27,31,32,29,70 1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-20
35,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035 na na na GGA na na na True na CAAACTCTAAATTTTCTCTTGGAAA
CTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAA
ACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGA
TCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTA
CTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTC
TAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCAT
ATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTT
GGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCT
GAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGC
CGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAA
ACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCA
AACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAG
ATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGT
ACTCATTATCTGAGGAGCCGGTCACCTG 1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978 262 n
a
3 41280844 41280845 3 41281303 41281304 2 . + + L R del 458 CTNNB1 NM_001330729.2 2704 16 True C
TNNB1 NM_001330729.2 2858 17 False na na na na 3UTR 13 175-176 na na na na na na na na na TTCACTTCTTGAGTCAC
TCCCAAAATCCATTTGTATTGTTACTCCTCGACCTAAAGGATGATTTACAGGTCAGTATCAAACCAGGCCAGCTGATTGCTGT 175-176 33 na
if you give me your "transcripts_fa" to generate your c2t bam, I can try it to see if this is the reason why you don't get it in your run.
I think for the c2g alignment the cigarstring is like this "105M90N35M1N58I97M86N107M" at the ITD, there is the "1N" in front of the "58I", as "N" supposedly represents intron, I think PAVFinder doesn't know how to interpret this, it may not be able to map it to an exon, and skip the event.
I am using an almost identical command to run path finder. Here is my output file: GSM1636766_ITD_genes.sv.bedpe.zip You can see the exact command that I used to run in the header. I see the same CTNNB1 UTR deletion that you see. However, in FLT3 I see two deletions and no ITD. Why do you think I may be getting different results than you?
Hmmm...thought the culprit was the difference between the FLT3 transcript used in the c2t alignment, but just checked your c2t bam I do see the 57I in the cigarsting.
Can you send me your /data/hg19.ncbiRefSeq.chrOnly.sorted.fasta
file? that's the only file I'm missing in order to replicate your run.
It is too large to upload here. Could you please email me at rebecca dot halperin at sema4 dot com and I'll send it to you?
If the sequence is too large to get through email, you can probably post the FLT3 sequence from your reference transcripts fasta, using it I should theoretically get the same result as yours on FLT3
Able to get the event (although it's categorized as "dup" instead of "ITD") with just your FLT3 transcript in the reference transcript fasta. This is what I've done (generating a new c2t.bam with the FLT3-only transcripts fasta first)
bwa mem flt3.fa pavfinder_input/GSM1636766_ITD_genes.transcripts.nr.fa | samtools view -bhS - -o c2t.flt3.bam
python ~/work/git/pavfinder/pavfinder/scripts/find_sv_transcriptome.py --gbam pavfinder_input/GSM1636766_ITD_genes.c2g.sorted.bam --tbam c2t.flt3.bam --r2c pavfinder_input/GSM1636766_ITD_genes.sorted.r2c.bam --transcripts_fasta flt3.fa pavfinder_input/GSM1636766_ITD_genes.transcripts.nr.fa hg19.ncbiRefSeq.chrOnly.sorted.gtf.gz /projects/btl/rchiu/hg19a.fa test3
and this is the event in the sv.bedpe
produced (same as what I reported above except the event_type
:
grep FLT3 test3/sv.bedpe
13 28608124 28608125 13 28608267 28608268 1 . + + dup 57 FLT3 NR_130706.2 1907 15 False FLT3 NR_130706.2 1854 14 Falsna na na na na 29,28,89,5,70,77,25,31,32,27,26,74,30,73,71 1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035 na na na GGA na na na na na CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG 1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978 262 na
Hmmm...so you're still not able to replicate the issue that I'm having. It looks like we're both using pavfinder 1.8.1
, so maybe it is a different version of a dependency. Here's what I've got:
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1188
samtools 1.15
Using htslib 1.15
GMAP version 2021-12-17
pybedtools==0.9.0
pysam==0.19.1
intspan==1.6.1
biopython==1.79
since you have pavfinder v1.8.1 in the output, I assume you have cloned the repo and run pavfinder from there
I haven't made a PR yet but will do so soon.
so far my conclusion is that this event will not be detected through c2g but only through c2t alignments, and I made sure for a positive contig sequence, such as contig 5, it has the identical cigarstring in both your pavfinder_input/GSM1636766_ITD_genes.c2t.sorted.bam
and my c2t using just the FLT3 transcript NR_130706.2
sequence:
5 16 NR_130706.2 11 60 1843M57I420M132D1553M4S
as the insertion 57I
is present in the cigarstring, I'm really puzzled why you can't capture the event.
Can you try running JUST using --c2t
(do not use --c2g
) to see what you got
python ~/work/git/pavfinder/pavfinder/scripts/find_sv_transcriptome.py --tbam c2t.flt3.bam --r2c pavfinder_input/GSM1636766_ITD_genes.sorted.r2c.bam --transcripts_fasta flt3.fa pavfinder_input/GSM1636766_ITD_genes.transcripts.nr.fa hg19.ncbiRefSeq.chrOnly.sorted.gtf.gz /projects/btl/rchiu/hg19a.fa test4c
oh, I've never seen your command, can you post how you ran pavfinder?
I figured out the problem. I took a more careful look at the stdout and noticed the message /bin/sh: 1: blastn: not found
. I installed blast, and reran pavfinder and it was able to detect the event!
You should have blast listed as a dependency and also have pavfinder exit with non-zero exit status if blast is not installed.
sorry for the overlook, I guess it just output a message to STDERR if blastn
is not in the PATH. I think that's not enough.
blastn
is required for detecting duplications, as there are many sv types attempted by the software, I overlooked that this is a requirement for detecting duplication specifically.
I will update the dependency. Thanks for sorting it out.