gffread
gffread copied to clipboard
-H still has no effect on the output gff
I know there is a previous bugfix on this #65 , and it was marked a closed issue, but I still see it, using 0.12.7
Here's the test input, a transcript with an incorrect translation start coordinate (off by 1)
$cat badstart.fa
>chr_I:c7772805-7771137
TATGCAGGCAAGTGCAGATATTCGCACATTAGAGCGCAATGTCTTTATCAATGCCATTACAAAGATCAAGAGCGAGAATA
AAGGATTCAAATTCATTCTCGTTGTCGATACAATGACATTGCCAGCAATTTCTGCTTTAATCAGCACAACAGAGCTCATT
GAGCACAATGTCATTCTGACAGAACTTTACGAGAAGAAGCGTGAACCCTTCCCAATGTATCAAGCTATTTATTTACTTCA
TCCTTGCGTTGATATCAAACAATTTATTGCTGATTTCAGTGATAAAACACCATTATACGCGGCAGCTCACCTTTTATTCA
CATACACATGCCTTCCATCAACAATGAGTGCATTAGAATCACATCCTGAAGTTGTCAATCACATTTTAACATTTCTTGAT
CTCTGGATCCATTATCAGCCTACACAACCAGGTGTTTTCATGGTACCATCACCACGAGCCTACAAATTCCTCTATTCTCC
AAACTCTCCCGTTGATTTCAACGCGATCTCCGACGAGATCAACTACGGCCTTATCAGCTTCTTCTTGACCATCAAGGCCA
TCCCAAGTATCGCCTATCCAAAATCCATGGATAAAATCAAGAGATTCTGCACAAAATTCAACGATCAATTAATGGGCGTT
ACTCGCGAGCTAGATGATGGCGGACGATCCCTTAACAAGGACAACACCCTTCTTCTCATCATTCCTCGTGGTGCCGACCC
AATTACGCCATTACTCCACAGATTCACCTACCAATCCATGATTTACGAGAACTTCCCAGTTGAAAACGACTGTGTCTACC
TTAAGGAAGGCGATCCGAACAGCGTTGTCGCACTCAATCCATACGAAGACAAAATCTTCAACGATTACTGCTACAAACAC
TTCCAGAAGTTCATCGAGATCCGTGATCTCTCATCGCAAGTCACACAGACCCACAAAATCATTACAGATCCAAATATCGA
CAAAACATCTCAAAAATATGCGGAAGCAATGAAGCGCTACATCAGAGATCAGTCGTATGCTACAACAGTATCCAACCACC
TCGACATTTGTATACGTCTGGATGAAGCATTAACTAAGCGTTGCCTTGCAGATATTTCGAAATACGAACAACAGCTTGCC
GCCAAGGAGAAAGACGGCTCCAGCTACAAGCCGAACCTTTCCGACCTCCAGGACATCATTATCAAGCAGGGACCAGAGCC
AATGGACAAGTTAAGAGCATTGGCCGTTTACCAACTTTCCGGTGGAAAAATTTCTGACACAGAATTGGAGAGGCTTTTAG
AGGCAGGATCACTTACAACGGGCAACTGGAAAGATGCGATCTTGAACCTCAAATACCTTGGAGATGCAGCTCCAAGACAA
TCCTTGAGAGAAGTAAATCCGGTCGAAGAACCATTTGTTACTGACATTTACTACCCTTACGCTGCACAGGCTTGCATGGA
CGCAATCGATGGAAAACTCGACAAGAAGTGCTTCGAAATTCCAACACACACAGGAAGATACACAAACGTTGTTTTGTTCT
TCTTCGGAGGTGTTTCGTTCGCTGAATTAGAGAAGGTAGAGTTGCTCAGACAGAGAATGAAGAGCACAAAGATCTACGTC
GGTGCAACAAACACTTTGACGCCAAGACAGTTCTTGGCTCAGACAAGAGGTTTAACTTCAACTCAGTAA
cat short.badstart.gff
chr_I AUGUSTUS transcript 7771137 7772805 . - . ID=mrna-0948210
chr_I AUGUSTUS CDS 7771137 7772805 . - ID=cds-0948210;Parent=mrna-0948210
gffread correctly recognizes it as having in frame stop codons (at phase 0).
Command line was:
./gffread -g genome.fasta -P -v -E short.badstart.gff
.. loaded 1 genomic features from short.badstart.gff
Warning: In-frame STOP found for 'mrna-TVAGG3_0948210'
##gff-version 3
# gffread v0.12.7
# ./gffread -g genome.fasta -P -v -E short.badstart.gff
chr_I AUGUSTUS transcript 7771137 7772805 . - . ID=mrna-0948210;InFrameStop=true
chr_I AUGUSTUS CDS 7771137 7772805 . - 0 Parent=mrna-0948210
-y output has the numerous in-frame stops resulting from phase 0 translation
$cat short.badstart.outfasta
>mrna-0948210 InFrameStop=true
YAGKCRYSHIRAQCLYQCHYKDQERE.RIQIHSRCRYNDIASNFCFNQHNRAH.AQCHSDRTLREEA.TL
PNVSSYLFTSSLR.YQTIYC.FQ..NTIIRGSSPFIHIHMPSINNECIRITS.SCQSHFNIS.SLDPLSA
YTTRCFHGTITTSLQIPLFSKLSR.FQRDLRRDQLRPYQLLLDHQGHPKYRLSKIHG.NQEILHKIQRSI
NGRYSRAR.WRTIP.QGQHPSSHHSSWCRPNYAITPQIHLPIHDLRELPS.KRLCLP.GRRSEQRCRTQS
IRRQNLQRLLLQTLPEVHRDP.SLIASHTDPQNHYRSKYRQNISKICGSNEALHQRSVVCYNSIQPPRHL
YTSG.SIN.ALPCRYFEIRTTACRQGERRLQLQAEPFRPPGHHYQAGTRANGQVKSIGRLPTFRWKNF.H
RIGEAFRGRITYNGQLERCDLEPQIPWRCSSKTILERSKSGRRTICY.HLLPLRCTGLHGRNRWKTRQEV
LRNSNTHRKIHKRCFVLLRRCFVR.IREGRVAQTENEEHKDLRRCNKHFDAKTVLGSDKRFNFNSV
adding the -H argument removes the inframestop warning, and corrects the protein output, but does not correct the gff coordinates or phase:
Command line was:
./gffread -g genome.fasta -H -P -v -E short.badstart.gff
.. loaded 1 genomic features from short.badstart.gff
##gff-version 3
# gffread v0.12.7
# ./gffread -g genome.fasta -H -P -v -E short.badstart.gff
chr_I AUGUSTUS transcript 7771137 7772805 . - . ID=mrna-0948210
chr_I AUGUSTUS CDS 7771137 7772805 . - 0 Parent=mrna-0948210
-y output is correct
$cat short.badstart.H.out.fasta
>mrna-0948210
MQASADIRTLERNVFINAITKIKSENKGFKFILVVDTMTLPAISALISTTELIEHNVILTELYEKKREPF
PMYQAIYLLHPCVDIKQFIADFSDKTPLYAAAHLLFTYTCLPSTMSALESHPEVVNHILTFLDLWIHYQP
TQPGVFMVPSPRAYKFLYSPNSPVDFNAISDEINYGLISFFLTIKAIPSIAYPKSMDKIKRFCTKFNDQL
MGVTRELDDGGRSLNKDNTLLLIIPRGADPITPLLHRFTYQSMIYENFPVENDCVYLKEGDPNSVVALNP
YEDKIFNDYCYKHFQKFIEIRDLSSQVTQTHKIITDPNIDKTSQKYAEAMKRYIRDQSYATTVSNHLDIC
IRLDEALTKRCLADISKYEQQLAAKEKDGSSYKPNLSDLQDIIIKQGPEPMDKLRALAVYQLSGGKISDT
ELERLLEAGSLTTGNWKDAILNLKYLGDAAPRQSLREVNPVEEPFVTDIYYPYAAQACMDAIDGKLDKKC
FEIPTHTGRYTNVVLFFFGGVSFAELEKVELLRQRMKSTKIYVGATNTLTPRQFLAQTRGLTSTQ
I would think the -H corrected gff output should look like:
chr_I AUGUSTUS transcript 7771137 7772804 . - . ID=mrna-0948210
chr_I AUGUSTUS CDS 7771137 7772804 . - 0 Parent=mrna-0948210
or, if only the phase is corrected
chr_I AUGUSTUS transcript 7771137 7772805 . - . ID=mrna-0948210
chr_I AUGUSTUS CDS 7771137 7772805 . - 1 Parent=mrna-0948210