gffread
gffread copied to clipboard
CDS and protein
I installed gffread and make first this command: gffread -w transcript.fa -g Genome.fasta Anotacion.gff and looks ok (all the CDS transcripts starts with ATG)
then I tried: gffread -y proteins.pep -g Genome.fasta Anotacion.gff
In this case the protein dont start with M, losing the correct ORF. I noticed, that the transcript fasta file the transcripts name have different CDS coordinates: 3500 transcripts start the CDS=1-end but 12356 transcripts with CDS=2-end and 4015 CDS=3-end.
So, only 3500 CDS have coordinates starting from 1 to end and get correct protein, but 12356 transcripts start the CDS in 2, making the protein from and incorrect start (an generating stop codon because the incorrect ORF).
My question is how to correct the gff files or the starting point, because gffread is the first step in PanExplorer pipeline, and I need to continue with all the proteins to get the results