CHM13 cannot get accurate protein sequences from the gff file

I tried to extracted the cds sequences from the gff file.

gffread -g chm13.draft_v1.1.fasta -x cds.fa chm13.draft_v1.1.gene_annotation.v4.gff3

however, when trying to translate the cds to proteins, the open reading frame is not correct for quite many sequences. Is there a way to download the predicted protein sequences?

Oct 24 '21 08:10 ATPs

It would be very interesting to also use Alphafold to finally complete the new 115 proteins' structures from this complete genome. All other proteins are decoded recently already by https://alphafold.ebi.ac.uk (some only available in downloads).

Nov 26 '21 22:11 ValZapod

Hi @ATPs ,

I created a file with the predicted protein sequences here that you can use: http://courtyard.gi.ucsc.edu/~mhauknes/T2T/chm13.draft_v1.1.gene_annotation.protein.fasta

Dec 07 '21 05:12 mhaukness-ucsc

These incorrect open reading frames are to be expected from the GENCODE annotation (they aren't errors). For example, many of the transcripts in GENCODE have tags like cds_end_NF and cds_start_NF which are fragments that are annotated (probably from ESTs) but have a lack of sufficient evidence. These are propagated down into our gene annotations. You can ignore any transcripts with the tag proper_orf=False in the gff3 if you want to include only transcripts with full, proper ORFs.

Dec 07 '21 21:12 mhaukness-ucsc