GeneFuse icon indicating copy to clipboard operation
GeneFuse copied to clipboard

Fusion-making .py script does not take into account gene orientation

Open nvolkovaGEL opened this issue 4 years ago • 8 comments

Thanks for creating this tool! I have noticed that the scripts/make_fusion_genes.py reports wrong order of exons for genes in '-' orientation, which results in considering some fusions as untranscribed and showing wrong fusion structure.

E.g. NTRK3 in cancer.hg38.csv:

NTRK3_ENST00000394480.6,chr15:87859751-88256768 1,88256644,88256747 2,88256284,88256486 3,88255906,88256168 4,88184225,88184299 ...

Versus NTRK3 from the fusions.csv file generated by the script:

NTRK3_NM_002530,chr15:87859750-88256739 1,87859750,87877120 2,87880269,87880428 3,87929190,87929434 4,87933011,87933184 ...

Here is a quick fix I introduced to make_fusion_genes.py:

Line 33: _, transcript, chrom, strand, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t") Line 44: _, transcript, chrom, strand, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t") And insert these after line 50: if strand == '-': exons = exons[::-1]

This returns the exons in the right order.

nvolkovaGEL avatar Jun 11 '20 09:06 nvolkovaGEL

@ZKai0801

Can you please check and fix this issue?

sfchen avatar Jun 23 '20 00:06 sfchen

Hi, just to add to this, I encountered one more bug in the fusion list making script: it picks up the wrong gene if the gene name is contained in some other gene name (like RET and RETREG1) and the relevant transcript is not supplied. Example: grep RET refFlat.txt

RETREG1 NM_019000 chr5 - 16473037 16508998 16474740 16508612 7 16473037,16477661,16478033,16478849,16481008,16483345,16508577, 16475234,16477788,16478098,16478987,16481093,16483472,16508998, RET NM_020975 chr10 + 43077068 43130349 43077258 43128269 20 43077068,43100458,43102341,43104951,43106375,43109030,43111206,43112098,43112852,43113555,43114479,43116583,43118372,43119530,43120080,43121945,43123670,43124882,43126574,43128111, 43077331,43100722,43102629,43105193,43106571,43109230,43111465,43112224,43112963,43113675,43114736,43116731,43118480,43119745,43120203,43122016,43123808,43124982,43126722,43130349,

and if I supply RET to current script, I'll get this:

RET_NM_001034850,chr5:16473037-16617058

I suspect this may be an issue for transcripts as well. I invented a quick fix, but someone more keen with regular expressions can just check if gene[0] or gene[1] plus tab symbol is in the line:

    if len(gene) == 1:
        transcripts = {}
        with open(refflat, "r") as fh:
            for line in fh:
                cur_gene, transcript, chrom, strand, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
                if cur_gene != gene[0]:
                    continue
                transcripts[transcript] = (chrom, start, end, exonstart, exonend)
        transcript = get_longest_transcript(transcripts.keys(), refflat)
        chrom, start, end, exonstart, exonend  = transcripts[transcript]
 # use user-specified transcript
    elif len(gene) == 2:
        with open(refflat, "r") as fh:
            for line in fh:
                _, transcript, chrom, _, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
                if transcript != gene[1]:
                    continue
                break`

nvolkovaGEL avatar Jun 26 '20 14:06 nvolkovaGEL

Thank you very much for reporting bugs and providing solutions! I have fixed the two bugs you mentioned as the way you suggested. @sfchen please could you merge the pull request I just created?

ZKai0801 avatar Jun 28 '20 01:06 ZKai0801

Just found out that I did not fully fix the strand issue, should have added strand to the identification of the longest transcript as well:

    if len(gene) == 1:
        transcripts = {}
        with open(refflat, "r") as fh:
            for line in fh:
                cur_gene, transcript, chrom, strand, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
                if cur_gene != gene[0]:
                        continue
                transcripts[transcript] = (chrom, strand, start, end, exonstart, exonend)
        transcript = get_longest_transcript(transcripts.keys(), refflat)
        chrom, strand, start, end, exonstart, exonend  = transcripts[transcript]

Now should be fully correct! Happy to help!

nvolkovaGEL avatar Jun 28 '20 19:06 nvolkovaGEL

Just found out that I did not fully fix the strand issue, should have added strand to the identification of the longest transcript as well:

    if len(gene) == 1:
        transcripts = {}
        with open(refflat, "r") as fh:
            for line in fh:
                cur_gene, transcript, chrom, strand, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
                if cur_gene != gene[0]:
                        continue
                transcripts[transcript] = (chrom, strand, start, end, exonstart, exonend)
        transcript = get_longest_transcript(transcripts.keys(), refflat)
        chrom, strand, start, end, exonstart, exonend  = transcripts[transcript]

Now should be fully correct! Happy to help!

You should submit another PR?

sfchen avatar Jun 28 '20 23:06 sfchen

@nvolkovaGEL You are perfectly right, I should have more test before made commit. Besides adding the amendment as you suggested, I added some error msgs to the code just in case people put wrong gene symbol/transcript to the input file. @sfchen Could you please merge my second PR? sorry for the trouble

ZKai0801 avatar Jun 29 '20 02:06 ZKai0801

Merged.

sfchen avatar Jun 29 '20 02:06 sfchen

Sorry, should have just made a pull request, but couldn't get my locked local machine to synchronize properly with github. Thanks, please feel free to close this ticket!

nvolkovaGEL avatar Jun 29 '20 16:06 nvolkovaGEL