any2fasta icon indicating copy to clipboard operation
any2fasta copied to clipboard

NCBI gbff file to fasta

Open tiramisutes opened this issue 5 years ago • 2 comments

Dear, The gbff file in NCBI usually have follows format. There is all the sequence will be output when using any2fasta command, but how to just output the CDS sequence? Thanks.

LOCUS       XM_017747270            2892 bp    mRNA    linear   PLN 09-AUG-2016
DEFINITION  PREDICTED: Gossypium arboreum serine/threonine protein phosphatase
            2A regulatory subunit B''alpha-like (LOC108487170), transcript
            variant X1, mRNA.
ACCESSION   XM_017747270
VERSION     XM_017747270.1
DBLINK      BioProject: PRJNA335838
KEYWORDS    RefSeq.
SOURCE      Gossypium arboreum
  ORGANISM  Gossypium arboreum
            Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
            Spermatophyta; Magnoliophyta; eudicotyledons; Gunneridae;
            Pentapetalae; rosids; malvids; Malvales; Malvaceae; Malvoideae;
            Gossypium.
COMMENT     MODEL REFSEQ:  This record is predicted by automated computational
            analysis. This record is derived from a genomic sequence
            (NC_030664.1) annotated using gene prediction method: Gnomon.
            Also see:
                Documentation of NCBI's Annotation Process
            
            ##Genome-Annotation-Data-START##
            Annotation Provider         :: NCBI
            Annotation Status           :: Full annotation
            Annotation Version          :: Gossypium arboreum Annotation
                                           Release 100
            Annotation Pipeline         :: NCBI eukaryotic genome annotation
                                           pipeline
            Annotation Software Version :: 7.1
            Annotation Method           :: Best-placed RefSeq; Gnomon
            Features Annotated          :: Gene; mRNA; CDS; ncRNA
            ##Genome-Annotation-Data-END##
FEATURES             Location/Qualifiers
     source          1..2892
                     /organism="Gossypium arboreum"
                     /mol_type="mRNA"
                     /cultivar="Shixiya1"
                     /db_xref="taxon:29729"
                     /chromosome="1"
                     /country="China"
                     /collection_date="May-2010"
     gene            1..2892
                     /gene="LOC108487170"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Gnomon. Supporting evidence
                     includes similarity to: 16 Proteins, and 100% coverage of
                     the annotated genomic feature by RNAseq alignments,
                     including 10 samples with support for all annotated
                     introns"
                     /db_xref="GeneID:108487170"
     CDS             634..2271
                     /gene="LOC108487170"
                     /codon_start=1
                     /product="serine/threonine protein phosphatase 2A
                     regulatory subunit B''alpha-like"
                     /protein_id="XP_017602759.1"
                     /db_xref="GeneID:108487170"
                     /translation="MSLSIKMDIDAVEDVTCLDPELLQLPDVSPFALKASPQLVEDFF
                     SQWLSLPGTGHLVKSLIDDAKSGTIVNASANFSTLNAVGSHSLSSMFPSSNAPPLSPR
                     SSSGSPRTSKQKSSPSALGSPLKLVSEPMQEIIPQFYFQNGCPPTKELKEQCLSQINH
                     LFNNPLNGLQIDEFKAVTKEVCKLPSFLSSALFRKIDVEWTGIVTRDAFIKYWVDGNM
                     LTMDIATQIFEILKRPGCKYLTQVDFKPVLRELLATHPGLEFLRNTPEFQDRYAETVI
                     YRIFYHINRSGNGRLTLRELKRGSLVAAMQHADEEEDINKVLRYFSYEHFYVIYCKFW
                     ELDTDHDFFIDRENLIRYGNHALTYRIVDRIFSQAPRKFTSEVEGKMGYEDFVYFMFS
                     EEDKSSQPSLEYWFKCIDLDGNGVLTPNEMQFFYEEQLHRMECMAQEPVLFEDILCQI
                     IDMIAPEREYCITLQDLKRCKLSGNVFNILFNLNKFVAFESRDPFLIRQEREEPTLTE
                     WDHFAHREYIRLSMEEDVEDASNGSAEVWDESLEAPF"
ORIGIN      
        1 tatctttcat ccttcttcgc tgcagcttcc tattcctttt agtttcccct atgtccactc
       61 tctctgtaat aaaatcaaat gctaataata atactttgat ttctctgctc ctgttttctt
      121 cctctctccg tttcttttta atttttaaaa ccattcccta cttttaatca aattcacgtc
      181 aaatctcatt atcttcttgg catttttaag ttttttttcc gcactgaaag ttaacggaaa
      241 gtactcgaga atttatcagt ttctcttttt ggaagtaaaa caggctaaat tctttcgaga
      301 ctcttcgaag gatttggtat tccagtttat tcataacgcc ggcagctagg gttttggaga
      361 acggcgtatt ttaaacggtt acgtttctac ttccgttgaa gaaaaaaagg attttaccgt
      421 cttttttcct taactctttg gagcaagatt ttgtaattat ttccacggta tcgtcaattt
      481 accatatcat ttcggagcgt gttctttttc ccagttagag aaatctccga agtggcgttg
      541 atttcttttt gctgttgcat ttgaagaatt tgaaagagtt acaagtttta gggtgtttat
      601 ttttatttag tgctgtttga taaggtaggc gagatgtcat tatctataaa gatggatatt
      661 gatgcagtgg aggatgttac ttgtttggac cctgagcttt tgcagcttcc tgatgtttct
      721 ccatttgcac taaaagccag tcctcaactt gtagaggact ttttctctca gtggctttcg
      781 cttcctggga ccggccatct ggtgaaatct ttgattgatg atgcaaagtc agggacaata
      841 gttaacgctt ctgcaaactt ttctactcta aatgctgttg ggagccattc gttgtcttcc
      901 atgtttccaa gtagcaatgc acctccactt tctccaagaa gctcatctgg ttctcctcgc
      961 acgtcaaagc agaagtccag cccttctgct cttggctctc cattgaaatt agttagtgaa
     1021 ccaatgcaag aaatcattcc acagttttat ttccaaaatg gttgtccacc aaccaaggaa
     1081 ttgaaagaac aatgtctttc tcaaattaat caccttttta ataatcctct aaatggattg
     1141 caaatagatg agtttaaagc agtgacaaag gaagtttgca agctaccatc tttcctctct
     1201 tctgcacttt ttagaaaaat agatgtagag tggactggaa tagtgaccag agatgctttc
     1261 attaagtatt gggttgatgg aaatatgctg acgatggata tagcaactca aatatttgaa
     1321 attcttaagc gtccaggctg caagtacctc actcaggttg acttcaaacc tgttcttcga
     1381 gaacttttgg cgacccatcc aggattagaa ttcctgcgga acacgcctga atttcaagat
     1441 agatacgctg aaactgtcat atacagaata ttttatcaca tcaatagatc gggaaatggc
     1501 cgtcttaccc tcagggagct caaaagagga agtctggttg ctgccatgca acatgctgat
     1561 gaggaagagg acattaacaa agtccttagg tacttctcat atgaacattt ctatgttata
     1621 tactgtaagt tttgggagtt ggacacggac catgatttct tcatcgacag agaaaatctc
     1681 attagatatg gcaatcatgc ccttacctac aggattgttg atagaatatt ttcacaggct
     1741 ccacgaaaat ttactagtga ggtagaaggg aagatgggtt atgaggactt tgtctacttc
     1801 atgttttcgg aggaggacaa atcatctcag cctagtcttg agtattggtt taagtgcata
     1861 gatttggatg gaaatggtgt gctgacgcca aatgaaatgc aatttttcta tgaggagcag
     1921 ctgcatcgaa tggaatgcat ggcccaggaa cctgtgctct ttgaggacat attgtgtcaa
     1981 ataattgaca tgattgctcc tgagagagaa tattgcatca cgctacagga tttgaaaaga
     2041 tgcaaacttt caggaaatgt ttttaacatc cttttcaatc ttaataagtt tgtggctttc
     2101 gaaagccgtg atccattcct catacggcag gaacgtgagg aaccaacttt gacagagtgg
     2161 gatcactttg cacatagaga gtatatcagg ctttcaatgg aagaagatgt tgaagacgct
     2221 tcgaatggga gtgctgaagt atgggatgag tcgcttgaag ctccatttta atttttaagg
     2281 ttgctgaggt gagttttgta gtaccttgtc aaaagataat attcaaggtg aatgaagaaa
     2341 aattggctac ttggacattc tgcagatggt gtgcttgtct gcaaagtgat tggccacaag
     2401 cttcaaattc attcgtatag attttaccta tatagttcac ctgcaggcta tctagttgcc
     2461 atttttgcaa ctaagtggcg gcaacaaaat ttctgtcagg aaagccaatt gcttctcata
     2521 caagagaggg ttgattctcc ctgctcttaa ctaatcacca tctccctccc aggccaggta
     2581 tcaacagtct gctactatgt taaaactttt tgttctgttt ttagttggtg aaacaatcat
     2641 ttactgttat cagtctgtgc ctttggggtc gtggaggaaa gtaaaggtgg atggtggata
     2701 ctgcgattgc cttgttttgg tttagtggcc gcccctatct ttgttgccaa acagaaattt
     2761 cgttccccct tcgttactag ctcaacgact cttacctttt tttctcagtt tttggtacaa
     2821 tgtacatgtt ccttattttt ttgatccagt gggtgaaatg aacacttttt tttttttaaa
     2881 aaaggaaaag tt
//

tiramisutes avatar Jul 16 '19 12:07 tiramisutes

There are two possibilities to get the CDS from a .gbff files.

  1. just extract the true /translation string from each CDS entry
  2. perform a canonical translation of the DNA source sequence using the CDS coordinates. I think all Genbank-produced files have /translation (1) but in general not all Genbank files have it. Also, (1) can be different to (2) because of post-translational modifications. The E.coli K12 genome has lots of examples of this.

Although this script is only really for DNA/contig data, it could potentially do (1) but will probably never do (2).

Is that what you desire?

tseemann avatar Jul 28 '19 22:07 tseemann

Alternatively, there are lots of other tools that do what you need, for example; https://pypi.org/project/gbseqextractor/

tseemann avatar Jul 28 '19 22:07 tseemann