GeneFuse
GeneFuse copied to clipboard
Fusion-making .py script does not take into account gene orientation
Thanks for creating this tool! I have noticed that the scripts/make_fusion_genes.py
reports wrong order of exons for genes in '-' orientation, which results in considering some fusions as untranscribed and showing wrong fusion structure.
E.g. NTRK3 in cancer.hg38.csv:
NTRK3_ENST00000394480.6,chr15:87859751-88256768 1,88256644,88256747 2,88256284,88256486 3,88255906,88256168 4,88184225,88184299 ...
Versus NTRK3 from the fusions.csv file generated by the script:
NTRK3_NM_002530,chr15:87859750-88256739 1,87859750,87877120 2,87880269,87880428 3,87929190,87929434 4,87933011,87933184 ...
Here is a quick fix I introduced to make_fusion_genes.py
:
Line 33:
_, transcript, chrom, strand, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
Line 44:
_, transcript, chrom, strand, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
And insert these after line 50:
if strand == '-':
exons = exons[::-1]
This returns the exons in the right order.
@ZKai0801
Can you please check and fix this issue?
Hi, just to add to this, I encountered one more bug in the fusion list making script: it picks up the wrong gene if the gene name is contained in some other gene name (like RET and RETREG1) and the relevant transcript is not supplied. Example:
grep RET refFlat.txt
RETREG1 NM_019000 chr5 - 16473037 16508998 16474740 16508612 7 16473037,16477661,16478033,16478849,16481008,16483345,16508577, 16475234,16477788,16478098,16478987,16481093,16483472,16508998, RET NM_020975 chr10 + 43077068 43130349 43077258 43128269 20 43077068,43100458,43102341,43104951,43106375,43109030,43111206,43112098,43112852,43113555,43114479,43116583,43118372,43119530,43120080,43121945,43123670,43124882,43126574,43128111, 43077331,43100722,43102629,43105193,43106571,43109230,43111465,43112224,43112963,43113675,43114736,43116731,43118480,43119745,43120203,43122016,43123808,43124982,43126722,43130349,
and if I supply RET to current script, I'll get this:
RET_NM_001034850,chr5:16473037-16617058
I suspect this may be an issue for transcripts as well. I invented a quick fix, but someone more keen with regular expressions can just check if gene[0] or gene[1] plus tab symbol is in the line:
if len(gene) == 1:
transcripts = {}
with open(refflat, "r") as fh:
for line in fh:
cur_gene, transcript, chrom, strand, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
if cur_gene != gene[0]:
continue
transcripts[transcript] = (chrom, start, end, exonstart, exonend)
transcript = get_longest_transcript(transcripts.keys(), refflat)
chrom, start, end, exonstart, exonend = transcripts[transcript]
# use user-specified transcript
elif len(gene) == 2:
with open(refflat, "r") as fh:
for line in fh:
_, transcript, chrom, _, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
if transcript != gene[1]:
continue
break`
Thank you very much for reporting bugs and providing solutions! I have fixed the two bugs you mentioned as the way you suggested. @sfchen please could you merge the pull request I just created?
Just found out that I did not fully fix the strand issue, should have added strand to the identification of the longest transcript as well:
if len(gene) == 1:
transcripts = {}
with open(refflat, "r") as fh:
for line in fh:
cur_gene, transcript, chrom, strand, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
if cur_gene != gene[0]:
continue
transcripts[transcript] = (chrom, strand, start, end, exonstart, exonend)
transcript = get_longest_transcript(transcripts.keys(), refflat)
chrom, strand, start, end, exonstart, exonend = transcripts[transcript]
Now should be fully correct! Happy to help!
Just found out that I did not fully fix the strand issue, should have added strand to the identification of the longest transcript as well:
if len(gene) == 1: transcripts = {} with open(refflat, "r") as fh: for line in fh: cur_gene, transcript, chrom, strand, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t") if cur_gene != gene[0]: continue transcripts[transcript] = (chrom, strand, start, end, exonstart, exonend) transcript = get_longest_transcript(transcripts.keys(), refflat) chrom, strand, start, end, exonstart, exonend = transcripts[transcript]
Now should be fully correct! Happy to help!
You should submit another PR?
@nvolkovaGEL You are perfectly right, I should have more test before made commit. Besides adding the amendment as you suggested, I added some error msgs to the code just in case people put wrong gene symbol/transcript to the input file. @sfchen Could you please merge my second PR? sorry for the trouble
Merged.
Sorry, should have just made a pull request, but couldn't get my locked local machine to synchronize properly with github. Thanks, please feel free to close this ticket!