GFF3toolkit
GFF3toolkit copied to clipboard
removing models from a list
Hello,
I would like to use the GFF3toolkit to remove some gene models (all with one isoform, from an external list) from a gff3 file. I first run
gff3_QC -g assembly_MAKER1.gff -f assembly.fa -o QC_report1 -s QC_stats1
and got this report:
==> QC_report <==
Line_num Error_code Error_level Error_tag
['Line 1'] Esf0014 Error ["##gff-version" missing from the first line]
['Line 15079'] Esf0012 Info [Found 5 Ns in CDS feature of length 296 using the external FASTA, consists of 1 segment (start, length): (210940, 5)]
==> QC_stats <==
Error_code Number_of_problematic_models Error_level Error_tag
Esf0014 1 Error ##gff-version" missing from the first line
Esf0012 1 Info Found Ns in a feature using the external FASTA
(I can fix the header myself)
I wonder how I can use gff3_fix
to remove ~1500 genes (gene, mRNA, exon, and CDS lines): is it possible to create a 4-column file to submit to -qc_r
? Can I use any of the error codes that have a "delete_model" function? Is there a way to specify the gene ID instead of the line number?
Also, is there a feature to remove gene models whose protein sequence does not start with M? Thanks, Dario
Hi @dcopetti - that's an interesting use case! I suppose you could hack a qc report file to get that done. The qc reports are line-based because not every feature in gff3 is required to have an ID. So you could provide the line number of the gene feature and assign it an error code that uses the delete_model function (https://github.com/NAL-i5K/GFF3toolkit/blob/master/docs/gff3_fix.py-documentation.rst). I've never tried this, but it might work.
The gff3toolkit doesn't have a function to flag or delete models with partial protein sequences.
Thanks, I will try it next time!
I found a way with gffread
, using --nids
and --keep_genes
- mine was not a new problem after all.