cannot get accurate protein sequences from the gff file
I tried to extracted the cds sequences from the gff file.
gffread -g chm13.draft_v1.1.fasta -x cds.fa chm13.draft_v1.1.gene_annotation.v4.gff3
however, when trying to translate the cds to proteins, the open reading frame is not correct for quite many sequences. Is there a way to download the predicted protein sequences?
It would be very interesting to also use Alphafold to finally complete the new 115 proteins' structures from this complete genome. All other proteins are decoded recently already by https://alphafold.ebi.ac.uk (some only available in downloads).
Hi @ATPs ,
I created a file with the predicted protein sequences here that you can use: http://courtyard.gi.ucsc.edu/~mhauknes/T2T/chm13.draft_v1.1.gene_annotation.protein.fasta
These incorrect open reading frames are to be expected from the GENCODE annotation (they aren't errors). For example, many of the transcripts in GENCODE have tags like cds_end_NF and cds_start_NF which are fragments that are annotated (probably from ESTs) but have a lack of sufficient evidence. These are propagated down into our gene annotations. You can ignore any transcripts with the tag proper_orf=False in the gff3 if you want to include only transcripts with full, proper ORFs.