any2fasta
any2fasta copied to clipboard
NCBI gbff file to fasta
Dear,
The gbff file in NCBI usually have follows format. There is all the sequence will be output when using any2fasta
command, but how to just output the CDS sequence? Thanks.
LOCUS XM_017747270 2892 bp mRNA linear PLN 09-AUG-2016
DEFINITION PREDICTED: Gossypium arboreum serine/threonine protein phosphatase
2A regulatory subunit B''alpha-like (LOC108487170), transcript
variant X1, mRNA.
ACCESSION XM_017747270
VERSION XM_017747270.1
DBLINK BioProject: PRJNA335838
KEYWORDS RefSeq.
SOURCE Gossypium arboreum
ORGANISM Gossypium arboreum
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; Gunneridae;
Pentapetalae; rosids; malvids; Malvales; Malvaceae; Malvoideae;
Gossypium.
COMMENT MODEL REFSEQ: This record is predicted by automated computational
analysis. This record is derived from a genomic sequence
(NC_030664.1) annotated using gene prediction method: Gnomon.
Also see:
Documentation of NCBI's Annotation Process
##Genome-Annotation-Data-START##
Annotation Provider :: NCBI
Annotation Status :: Full annotation
Annotation Version :: Gossypium arboreum Annotation
Release 100
Annotation Pipeline :: NCBI eukaryotic genome annotation
pipeline
Annotation Software Version :: 7.1
Annotation Method :: Best-placed RefSeq; Gnomon
Features Annotated :: Gene; mRNA; CDS; ncRNA
##Genome-Annotation-Data-END##
FEATURES Location/Qualifiers
source 1..2892
/organism="Gossypium arboreum"
/mol_type="mRNA"
/cultivar="Shixiya1"
/db_xref="taxon:29729"
/chromosome="1"
/country="China"
/collection_date="May-2010"
gene 1..2892
/gene="LOC108487170"
/note="Derived by automated computational analysis using
gene prediction method: Gnomon. Supporting evidence
includes similarity to: 16 Proteins, and 100% coverage of
the annotated genomic feature by RNAseq alignments,
including 10 samples with support for all annotated
introns"
/db_xref="GeneID:108487170"
CDS 634..2271
/gene="LOC108487170"
/codon_start=1
/product="serine/threonine protein phosphatase 2A
regulatory subunit B''alpha-like"
/protein_id="XP_017602759.1"
/db_xref="GeneID:108487170"
/translation="MSLSIKMDIDAVEDVTCLDPELLQLPDVSPFALKASPQLVEDFF
SQWLSLPGTGHLVKSLIDDAKSGTIVNASANFSTLNAVGSHSLSSMFPSSNAPPLSPR
SSSGSPRTSKQKSSPSALGSPLKLVSEPMQEIIPQFYFQNGCPPTKELKEQCLSQINH
LFNNPLNGLQIDEFKAVTKEVCKLPSFLSSALFRKIDVEWTGIVTRDAFIKYWVDGNM
LTMDIATQIFEILKRPGCKYLTQVDFKPVLRELLATHPGLEFLRNTPEFQDRYAETVI
YRIFYHINRSGNGRLTLRELKRGSLVAAMQHADEEEDINKVLRYFSYEHFYVIYCKFW
ELDTDHDFFIDRENLIRYGNHALTYRIVDRIFSQAPRKFTSEVEGKMGYEDFVYFMFS
EEDKSSQPSLEYWFKCIDLDGNGVLTPNEMQFFYEEQLHRMECMAQEPVLFEDILCQI
IDMIAPEREYCITLQDLKRCKLSGNVFNILFNLNKFVAFESRDPFLIRQEREEPTLTE
WDHFAHREYIRLSMEEDVEDASNGSAEVWDESLEAPF"
ORIGIN
1 tatctttcat ccttcttcgc tgcagcttcc tattcctttt agtttcccct atgtccactc
61 tctctgtaat aaaatcaaat gctaataata atactttgat ttctctgctc ctgttttctt
121 cctctctccg tttcttttta atttttaaaa ccattcccta cttttaatca aattcacgtc
181 aaatctcatt atcttcttgg catttttaag ttttttttcc gcactgaaag ttaacggaaa
241 gtactcgaga atttatcagt ttctcttttt ggaagtaaaa caggctaaat tctttcgaga
301 ctcttcgaag gatttggtat tccagtttat tcataacgcc ggcagctagg gttttggaga
361 acggcgtatt ttaaacggtt acgtttctac ttccgttgaa gaaaaaaagg attttaccgt
421 cttttttcct taactctttg gagcaagatt ttgtaattat ttccacggta tcgtcaattt
481 accatatcat ttcggagcgt gttctttttc ccagttagag aaatctccga agtggcgttg
541 atttcttttt gctgttgcat ttgaagaatt tgaaagagtt acaagtttta gggtgtttat
601 ttttatttag tgctgtttga taaggtaggc gagatgtcat tatctataaa gatggatatt
661 gatgcagtgg aggatgttac ttgtttggac cctgagcttt tgcagcttcc tgatgtttct
721 ccatttgcac taaaagccag tcctcaactt gtagaggact ttttctctca gtggctttcg
781 cttcctggga ccggccatct ggtgaaatct ttgattgatg atgcaaagtc agggacaata
841 gttaacgctt ctgcaaactt ttctactcta aatgctgttg ggagccattc gttgtcttcc
901 atgtttccaa gtagcaatgc acctccactt tctccaagaa gctcatctgg ttctcctcgc
961 acgtcaaagc agaagtccag cccttctgct cttggctctc cattgaaatt agttagtgaa
1021 ccaatgcaag aaatcattcc acagttttat ttccaaaatg gttgtccacc aaccaaggaa
1081 ttgaaagaac aatgtctttc tcaaattaat caccttttta ataatcctct aaatggattg
1141 caaatagatg agtttaaagc agtgacaaag gaagtttgca agctaccatc tttcctctct
1201 tctgcacttt ttagaaaaat agatgtagag tggactggaa tagtgaccag agatgctttc
1261 attaagtatt gggttgatgg aaatatgctg acgatggata tagcaactca aatatttgaa
1321 attcttaagc gtccaggctg caagtacctc actcaggttg acttcaaacc tgttcttcga
1381 gaacttttgg cgacccatcc aggattagaa ttcctgcgga acacgcctga atttcaagat
1441 agatacgctg aaactgtcat atacagaata ttttatcaca tcaatagatc gggaaatggc
1501 cgtcttaccc tcagggagct caaaagagga agtctggttg ctgccatgca acatgctgat
1561 gaggaagagg acattaacaa agtccttagg tacttctcat atgaacattt ctatgttata
1621 tactgtaagt tttgggagtt ggacacggac catgatttct tcatcgacag agaaaatctc
1681 attagatatg gcaatcatgc ccttacctac aggattgttg atagaatatt ttcacaggct
1741 ccacgaaaat ttactagtga ggtagaaggg aagatgggtt atgaggactt tgtctacttc
1801 atgttttcgg aggaggacaa atcatctcag cctagtcttg agtattggtt taagtgcata
1861 gatttggatg gaaatggtgt gctgacgcca aatgaaatgc aatttttcta tgaggagcag
1921 ctgcatcgaa tggaatgcat ggcccaggaa cctgtgctct ttgaggacat attgtgtcaa
1981 ataattgaca tgattgctcc tgagagagaa tattgcatca cgctacagga tttgaaaaga
2041 tgcaaacttt caggaaatgt ttttaacatc cttttcaatc ttaataagtt tgtggctttc
2101 gaaagccgtg atccattcct catacggcag gaacgtgagg aaccaacttt gacagagtgg
2161 gatcactttg cacatagaga gtatatcagg ctttcaatgg aagaagatgt tgaagacgct
2221 tcgaatggga gtgctgaagt atgggatgag tcgcttgaag ctccatttta atttttaagg
2281 ttgctgaggt gagttttgta gtaccttgtc aaaagataat attcaaggtg aatgaagaaa
2341 aattggctac ttggacattc tgcagatggt gtgcttgtct gcaaagtgat tggccacaag
2401 cttcaaattc attcgtatag attttaccta tatagttcac ctgcaggcta tctagttgcc
2461 atttttgcaa ctaagtggcg gcaacaaaat ttctgtcagg aaagccaatt gcttctcata
2521 caagagaggg ttgattctcc ctgctcttaa ctaatcacca tctccctccc aggccaggta
2581 tcaacagtct gctactatgt taaaactttt tgttctgttt ttagttggtg aaacaatcat
2641 ttactgttat cagtctgtgc ctttggggtc gtggaggaaa gtaaaggtgg atggtggata
2701 ctgcgattgc cttgttttgg tttagtggcc gcccctatct ttgttgccaa acagaaattt
2761 cgttccccct tcgttactag ctcaacgact cttacctttt tttctcagtt tttggtacaa
2821 tgtacatgtt ccttattttt ttgatccagt gggtgaaatg aacacttttt tttttttaaa
2881 aaaggaaaag tt
//
There are two possibilities to get the CDS from a .gbff
files.
- just extract the true
/translation
string from eachCDS
entry - perform a canonical translation of the DNA
source
sequence using theCDS
coordinates. I think all Genbank-produced files have/translation
(1) but in general not all Genbank files have it. Also, (1) can be different to (2) because of post-translational modifications. The E.coli K12 genome has lots of examples of this.
Although this script is only really for DNA/contig data, it could potentially do (1) but will probably never do (2).
Is that what you desire?
Alternatively, there are lots of other tools that do what you need, for example; https://pypi.org/project/gbseqextractor/