pavfinder icon indicating copy to clipboard operation
pavfinder copied to clipboard

false negative with "cannot map to exon" in log

Open rhalperin opened this issue 2 years ago • 18 comments

I ran find_sv_transcriptome.py on contigs assembled from a sample with a known FLT3-ITD. I do see the ITD sequence in some of the contigs, but I don't see the FLT3-ITD in the output file. For the contigs containing the ITD, I see the message printed:

Processing X:
X: cannot map event to exon

Here is an example contig containing the ITD, where I have the duplicated sequence bolded, and the exon 14 sequence in []. GGACTGTGGGGACAAGAGTAACTTTATTGAAAATACTAATCCTCCATGTTACTTCTGACTGGCCCTGAGTCTGGGAAGGCCGCCAAAGTGTCTAGGTGATGTATTACTCTTTATGGTAGAACACCTATTCATTATAAACTTCCCCCAATACAACCCCTGTTGTTGCAGAAATCTTAGGCTGTGACAACCATAGCTGCCTACACATTCCTTGTATCTTGGGGTAAAAGCACACGTGCTCTGGAAGGAATGTGTAGGTGGCTATGGGTGCACAATTTCAGGGGGTTCGTGAACTCCAGTTAAGACTTGCCCTAATTATACCATGTAAATAATTCAATAATGGGCAATTCTGTAGTAGAAATTTTATTCCCACCCATAAAATATATCACTAAATAGCTGAAAAATTTACATATTATTTTAAAACATAGACTTAAAAAATCATATTAGCTTCTCCTTAGCAAAATGCTTTTGTTTTATGTATTTACAAGAATATACTGTACTTCAGGTACACAATTCACTCAAGCCAGCCTGAGAAGGCCTTGGATGCAGATCAATGCTCCAATAAAGTTCATTATCAGCTCCTCCTGCCTTGTGACAGGATGATTTGATTTTACAAAAGTCCCTTTGAAAACAAGAGTAAACGCAGACAGCTTCTAGAGAAAAGTCTGGTGAAGCAGCAGTTGATAATAGATTTTCTTTTAGTGATGAAATTAATCTTGTTTTGGTAATCTACAGCCTGTTAGGGATAGGTGGAGGGATGAAGTCCTTAAAACTAAATTGTTCCTCTACGAATCTTCGACCTGAGCCTGCGGAGAGAGTAGCCCCAAATCCATCTCTCTGCTGAAAGGTCGCCTGTTTTGGTAGGTGTGAGGACATTCCGAAACACGGCCATCCACATTCTGATACATCGCTTCTTCTGCATCTGCCAGCTGACATCCTAAAAACGAAGTCAAATTAGGGAAGGATGGCCGTTTCCTTGAGTCAAAAGCCCAGCAGGATTGCATTATAATGTATATTTCTTCTGTAGCATAAAATGGCTGATCCATTTTAAATCCATTTTGAATCAGTTTGTAGAAGTTAGCATCAACCGGAATGCCAGGGTAAGGATTCACACCAAGTGAGAAGATTTCCCACAGTAATATTCCATATGACCAGACATCACTCTTAATGGTGTAGATGCCTTCAAACAGGCTTTCGGGGGCCATCCATTTTACAGGCAGACGGGCATTGCCCCTGACAACATAGTTGGAATCACTCATGATATCTCGAGCCAATCCAAAGTCACATATCTTCACCACTTTCCCGTGGGTGACAAGCACGTTCCTGGCGGCCAGGTCTCTGTGAACACACGACTTAAATTCCAGAAATTCCATTCCTTTGGCAACTTGATATGCAAAGCAAAGAAGATCTTCAAATGTAAGCACATTCAAGTCCTCCTCTTCTTCCAGCCTTTTTTGGTTTTCATATTCAATTTCATCTTCAGAGTGAAATGAATTCCCATGAAGCCCTGAGATTTGATCCGAGTCCGGGTGTATCTGAACTTCTCTTGAACCAGGCATGCTGGAATTTGGATGTGATTGGAAAGTGGGGTAAAAACTGAAATTGTGTTCCTTGAAAATCTCTGTCCAAGTCCTGTGAAATTTTTCTCTTTTACTTCTTAGATAGTTGAGAAGATCACCATAGCAACAGTATTCAAAAATCAAGTAAATTGGTCCTGACAGTGTGCACGCCCCCAGCAGGTTCACAATATTCTCGTGGCTTCCCAGCTGGGTCATCATCTTGAGTTCTGACATGAGTGCCTCTCTTTCAGAGCTGTCTGCTTTTTCTTTCAGCATTTTGACGGCAACCTGGATTGAGACTCCTGTTTTGCTAATTCCATAAGCTGTTGCGTTCATCACTTTTCCAAAAGCACCTGATCCTAGTACCTTCCCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACC[AAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTGTACCATCTGTAGCTGGCTTTCATACCTAAATTG]CTTTTTGTACTTGTGACAAATTAGCAGGGTTAAAACGACAATGAAGAGGAGACAAACACCAATTGTTGCATAGAATGAGATGTTGTCTTGGATGAAAGGGAAGGGGCCTGGAGAGTTTAAAAGGATCGTCTCACAAGATGTGCCAAGGGAATTGTATGCACAGCACTTGACCAGGAACCCTTTTATGGCTTCACTCATGTTTAGAGTACTGCTCGACACCCACTGTCCAAACACTTTTCTGTTAGCCTTTCTATTCCAGACTCCTTCTGTGATCTCTTCTGTGCAGTTGGGAGACTTGTCTGAACACTTCTTCCAGGTCCAAGATGGTAATGGGTATCCATCCGAGAAACAGGACGCCTGACTTGCCGATGCTTCTGCGAGCACTTGAGGTTTCCTTCTTATATTCAGCGTGAACATTTTGGTAAATTGGGCATCATCATTTTCTGCATGGAATATATATTCTCCTGGCTGGTGCTTATGATTGCAAAACTTGGATATGCTGTATCCGTTATCAAGACCCTTTTGCTCACAAGGAAATGATTTTCGAGAGAAGGTCCACGTACATCTGATTTGTGGGTAGGCTTTAAACCTGACAGAAAAACAAAACTCTTCATATTGGTCAATTTCATAATCTTCACTTGAATTGGTAGCATTTATAAATCCCTTTTCTACGATGGTAACCAAAGCTGATTGACTGGGATGCTTTGAAGAGGAACAAGTGTAGTATCCGGTGTCGTTTCTTGCCACTGATGATACAAAAGCAAACAGAATCCGTATCATAGTTCTGTTTGTTGAATAGGTACTCATCTCAAAGTAGTTGCCCTCCTCGAGTGCTTTGTTTTCTAATTCCCAGGTGAGCCCGAATCCATGGTTCACATGAACAGCTTTGCACCTTATCCATAAGGGTTCCCCTACTTTAAGAAATAATTGTGGCAATGTGGTCTGAGGAGTTTGATTTAGATCTATTGTGAACAGCCTGGTGCATTCCCTGCCCAGTTCATTTCTGGCACAGCACCTTATGTCCATCCCAAATAATTCATGAAGCACTTTTTCCTCCTTTTTAACAACAGCTGGACTTTCTTCTTTACAGCTTTCCCCCTGTGAATCGCAAAGCACCCATTCCACGATCGGCTCTGGAACGCTCTCAGATATGCAGACCAGGGCGTCCTGGTTTTCCATTTTTCTAAAGTAAGGTCTTCTTAATGTGTAAAGCAGGGTATTTCTTATACTCACTGTAAACAATATTGTGTAATTGGTAGCTTCACTCTGAATAAAAAGTAGGTATTCTCCAGCTTGGGTTTCTGTCATTTTCAAAATGACCATGGAAACAACTCCTCTGTTTTGTAAATCAAAATGTGGCTGGCAATTCAGGGAGCTGTGCTTAAAGACCCAGAGACAGGAAATGTTCCCTGGGGCATCGACCAGCACTTGCAGTGTGATGGAAGCAGATACATCCACTTCCACAGCGGCAGCTTCGTACACTGTCCCTGAGCTCTGGGGTCTCAACGCACACCCGAGGTCTTCCGGGGATTCTGATACCATGGGATATGATGATGACTTCCCCACTGATGAATCATTGTTCTTATGATTGATTAAAACACACTTGATCACAGGCAGATCTTGATTTGTAATAGTCCCAAATATCATTGCAGAAAAAACAACGAGCAGCGGCAGCTGGCCGCCGTCGCGCGCCAACGCCGGCATGGCCTCCGGAGCCCGGGGTCCCCAGGCCGCGCCGGCCCAGCCCTGCGATGCCGCCT

rhalperin avatar Jul 01 '22 17:07 rhalperin

PAVFinder identifies ITD when the duplication is reported as an insertion in the CIGAR string or a split alignment from the aligner.

I tested your sequence with GMAP, and unfortunately it doesn't report the duplication as an insertion but rather as a bunch of base substitutions and small indels. I also tried Minimap2, again no insertion in the CIGAR string but a series of substitutions and possibly a couple of indels. Possibly the near exon-boundary location made it challenging for the aligners.

However, the event does get picked up if the alignment is done against the transcript sequence. So if I supplied the c2t alignment (done by bwa-mem against, for example, the test transcript sequence provided in the repo (pavfinder/test/transcriptome/refGene.fa) which contains FLT3), the event is picked up - not labelled as an ITD but a "dup" (probably some coordinate issue doesn't qualify it as an ITD):

#chrom1	start1	end1	chrom2	start2	end2	name	score	strand1	strand2	orient1	orient2	event	size	gene1	transcript1	transcript_b
reak1	exon1	exon_bound1	gene2	transcript2	transcript_break2	exon2	exon_bound2	gene_5prime	gene_3prime	exon_5primeexon_3prime	feature	seq_id	seq_breaks	ins_seq	homol_seq	homol_seq_coords	novel_seq	novel_seq_coords	copy_number_
change	repeat_seq	in_frame	splice_motif	probe	support_span	spanning_reads	flanking_pairs
chr13	28608124	28608125	chr13	28608267	28608268	1	.	+	+	R	L	dup	57	FLT3
	NR_130706	1907	15	False	FLT3	NR_130706	1854	14	False	na	na	na	na	na	test	1977
-2035	na	na	na	GGA	na	na	na	na	na	CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAA
TTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG	1977-1978	na	na

It's one of the main original purposes for including c2t alignment, because the location of variants can make it tricky for splicing aligners but rather straightforward when no splicing needs to be considered.

readmanchiu avatar Jul 05 '22 16:07 readmanchiu

I think my GMAP alignments do have the insertion. Here's an IGV screenshot: image

Also, I did include the c2t alignment in my input, but did not see the event in my output. Here are my input files: pavfinder_input.zip

rhalperin avatar Jul 06 '22 12:07 rhalperin

A different version of GMAP may be able to capture the insertion. Will take a look of your files.

readmanchiu avatar Jul 06 '22 15:07 readmanchiu

Can I have the input fasta as well? thanks

readmanchiu avatar Jul 06 '22 15:07 readmanchiu

Here's the input fasta: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz Also, here's the GTF in case that helps hg19.ncbiRefSeq.chrOnly.sorted.gtf.gz

rhalperin avatar Jul 06 '22 19:07 rhalperin

I mean the contig sequences, not the reference genome sequence

readmanchiu avatar Jul 07 '22 00:07 readmanchiu

oops, never mind, the sequences are there

readmanchiu avatar Jul 07 '22 00:07 readmanchiu

There is indeed problem using c2g alone, but with c2t, I am able to get the ITD. I don't have the reference transcripts fasta you used for generating your c2t bam, so I generated my own using extract_transcript_sequence.py from the repo:

python extract_transcript_sequence.py hg19.ncbiRefSeq.chrOnly.sorted.gtf.gz hg19.ncbiRefSeq.fa hg19a.fa --only_longest --coding_first --index

(hg19a.fa is my version of hg19 without the "chr" prefix") I then just used bwa mem to align the contigs:

bwa mem -t24 hg19.ncbiRefSeq.fa pavfinder_input/GSM1636766_ITD_genes.transcripts.nr.fa | samtools view -bhS - -o GSM1636766_ITD_genes.c2t.bam 

then use these files to run pavfinder:

python ~/work/git/pavfinder/pavfinder/scripts/find_sv_transcriptome.py --gbam pavfinder_input/GSM1636766_ITD_genes.c2g.sorted.bam --tbam GSM1636766_ITD_genes.c2t.bam --transcripts_fasta hg19.ncbiRefSeq.fa pavfinder_input/GSM1636766_ITD_genes.transcripts.nr.fa hg19.ncbiRefSeq.chrOnly.sorted.gtf.gz --r2c pavfinder_input/GSM1636766_ITD_genes.sorted.r2c.bam /projects/btl/rchiu/hg19a.fa test2

and here's the output:

#pavfinder 1.8.1
#2022-07-06:18:34:28 /home/rchiu/work/git/pavfinder/pavfinder/scripts/find_sv_transcriptome.py --gbam pavfinder_input/GSM1636766_ITD_genes.c2g.sorted.bam --tbam GSM1636766_ITD_genes.c2t.bam --t
ranscripts_fasta hg19.ncbiRefSeq.fa pavfinder_input/GSM1636766_ITD_genes.transcripts.nr.fa hg19.ncbiRefSeq.chrOnly.sorted.gtf.gz --r2c pavfinder_input/GSM1636766_ITD_genes.sorted.r2c.bam /proje
cts/btl/rchiu/hg19a.fa test2
#chrom1	start1	end1	chrom2	start2	end2	name	score	strand1	strand2	orient1	orient2	event	size	gene1	transcript1	transcript_break1	exon1	exon_bound1	gene2	t
ranscript2	transcript_break2	exon2	exon_bound2	gene_5prime	gene_3prime	exon_5prime	exon_3prime	feature	seq_id	seq_breaks	ins_seq	homol_seq	homol_seq
_coords	novel_seq	novel_seq_coords	copy_number_change	repeat_seq	in_frame	splice_motif	probe	support_span	spanning_reads	flanking_pairs
13	28608124	28608125	13	28608267	28608268	1	.	+	+	R	L	ITD	57	FLT3	NM_004119.3	1907	15	False	F
LT3	NM_004119.3	1854	14	False	na	na	na	na	na	5,77,25,26,71,73,74,89,28,30,27,31,32,29,70	1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-20
35,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035	na	na	na	GGA	na	na	na	True	na	CAAACTCTAAATTTTCTCTTGGAAA
CTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAA
ACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGA
TCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTA
CTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTC
TAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCAT
ATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTT
GGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCT
GAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGC
CGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAA
ACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCA
AACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAG
ATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGT
ACTCATTATCTGAGGAGCCGGTCACCTG	1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978	262	n
a
3	41280844	41280845	3	41281303	41281304	2	.	+	+	L	R	del	458	CTNNB1	NM_001330729.2	2704	16	True	C
TNNB1	NM_001330729.2	2858	17	False	na	na	na	na	3UTR	13	175-176	na	na	na	na	na	na	na	na	na	TTCACTTCTTGAGTCAC
TCCCAAAATCCATTTGTATTGTTACTCCTCGACCTAAAGGATGATTTACAGGTCAGTATCAAACCAGGCCAGCTGATTGCTGT	175-176	33	na

if you give me your "transcripts_fa" to generate your c2t bam, I can try it to see if this is the reason why you don't get it in your run.

readmanchiu avatar Jul 07 '22 01:07 readmanchiu

I think for the c2g alignment the cigarstring is like this "105M90N35M1N58I97M86N107M" at the ITD, there is the "1N" in front of the "58I", as "N" supposedly represents intron, I think PAVFinder doesn't know how to interpret this, it may not be able to map it to an exon, and skip the event.

readmanchiu avatar Jul 07 '22 02:07 readmanchiu

I am using an almost identical command to run path finder. Here is my output file: GSM1636766_ITD_genes.sv.bedpe.zip You can see the exact command that I used to run in the header. I see the same CTNNB1 UTR deletion that you see. However, in FLT3 I see two deletions and no ITD. Why do you think I may be getting different results than you?

rhalperin avatar Jul 08 '22 20:07 rhalperin

Hmmm...thought the culprit was the difference between the FLT3 transcript used in the c2t alignment, but just checked your c2t bam I do see the 57I in the cigarsting. Can you send me your /data/hg19.ncbiRefSeq.chrOnly.sorted.fasta file? that's the only file I'm missing in order to replicate your run.

readmanchiu avatar Jul 08 '22 22:07 readmanchiu

It is too large to upload here. Could you please email me at rebecca dot halperin at sema4 dot com and I'll send it to you?

rhalperin avatar Jul 11 '22 12:07 rhalperin

If the sequence is too large to get through email, you can probably post the FLT3 sequence from your reference transcripts fasta, using it I should theoretically get the same result as yours on FLT3

readmanchiu avatar Jul 11 '22 14:07 readmanchiu

Able to get the event (although it's categorized as "dup" instead of "ITD") with just your FLT3 transcript in the reference transcript fasta. This is what I've done (generating a new c2t.bam with the FLT3-only transcripts fasta first)

bwa mem flt3.fa pavfinder_input/GSM1636766_ITD_genes.transcripts.nr.fa | samtools view -bhS - -o c2t.flt3.bam

python ~/work/git/pavfinder/pavfinder/scripts/find_sv_transcriptome.py --gbam pavfinder_input/GSM1636766_ITD_genes.c2g.sorted.bam --tbam c2t.flt3.bam --r2c pavfinder_input/GSM1636766_ITD_genes.sorted.r2c.bam --transcripts_fasta flt3.fa pavfinder_input/GSM1636766_ITD_genes.transcripts.nr.fa hg19.ncbiRefSeq.chrOnly.sorted.gtf.gz /projects/btl/rchiu/hg19a.fa test3

and this is the event in the sv.bedpe produced (same as what I reported above except the event_type:

grep FLT3 test3/sv.bedpe 
13	28608124	28608125	13	28608267	28608268	1	.	+	+	dup	57	FLT3	NR_130706.2	1907	15	False	FLT3	NR_130706.2	1854	14	Falsna	na	na	na	na	29,28,89,5,70,77,25,31,32,27,26,74,30,73,71	1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035,1977-2035	na	na	na	GGA	na	na	na	na	na	CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG,CAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATGGATTACCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTG	1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978,1977-1978	262	na

readmanchiu avatar Jul 11 '22 17:07 readmanchiu

Hmmm...so you're still not able to replicate the issue that I'm having. It looks like we're both using pavfinder 1.8.1, so maybe it is a different version of a dependency. Here's what I've got:

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1188

samtools 1.15
Using htslib 1.15

GMAP version 2021-12-17

pybedtools==0.9.0
pysam==0.19.1
intspan==1.6.1
biopython==1.79

rhalperin avatar Jul 12 '22 15:07 rhalperin

since you have pavfinder v1.8.1 in the output, I assume you have cloned the repo and run pavfinder from there I haven't made a PR yet but will do so soon. so far my conclusion is that this event will not be detected through c2g but only through c2t alignments, and I made sure for a positive contig sequence, such as contig 5, it has the identical cigarstring in both your pavfinder_input/GSM1636766_ITD_genes.c2t.sorted.bam and my c2t using just the FLT3 transcript NR_130706.2 sequence:

5	16	NR_130706.2	11	60	1843M57I420M132D1553M4S

as the insertion 57I is present in the cigarstring, I'm really puzzled why you can't capture the event. Can you try running JUST using --c2t (do not use --c2g) to see what you got

python ~/work/git/pavfinder/pavfinder/scripts/find_sv_transcriptome.py --tbam c2t.flt3.bam --r2c pavfinder_input/GSM1636766_ITD_genes.sorted.r2c.bam --transcripts_fasta flt3.fa pavfinder_input/GSM1636766_ITD_genes.transcripts.nr.fa hg19.ncbiRefSeq.chrOnly.sorted.gtf.gz /projects/btl/rchiu/hg19a.fa test4c

oh, I've never seen your command, can you post how you ran pavfinder?

readmanchiu avatar Jul 12 '22 16:07 readmanchiu

I figured out the problem. I took a more careful look at the stdout and noticed the message /bin/sh: 1: blastn: not found. I installed blast, and reran pavfinder and it was able to detect the event!

You should have blast listed as a dependency and also have pavfinder exit with non-zero exit status if blast is not installed.

rhalperin avatar Jul 12 '22 20:07 rhalperin

sorry for the overlook, I guess it just output a message to STDERR if blastn is not in the PATH. I think that's not enough. blastn is required for detecting duplications, as there are many sv types attempted by the software, I overlooked that this is a requirement for detecting duplication specifically. I will update the dependency. Thanks for sorting it out.

readmanchiu avatar Jul 12 '22 21:07 readmanchiu