GFF3toolkit icon indicating copy to clipboard operation
GFF3toolkit copied to clipboard

removing models from a list

Open dcopetti opened this issue 2 years ago • 2 comments

Hello,

I would like to use the GFF3toolkit to remove some gene models (all with one isoform, from an external list) from a gff3 file. I first run gff3_QC -g assembly_MAKER1.gff -f assembly.fa -o QC_report1 -s QC_stats1 and got this report:

==> QC_report <==
Line_num        Error_code      Error_level     Error_tag
['Line 1']      Esf0014 Error   ["##gff-version" missing from the first line]
['Line 15079']  Esf0012 Info    [Found 5 Ns in CDS feature of length 296 using the external FASTA, consists of 1 segment (start, length): (210940, 5)]

==> QC_stats <==
Error_code      Number_of_problematic_models    Error_level     Error_tag
Esf0014 1       Error   ##gff-version" missing from the first line
Esf0012 1       Info    Found Ns in a feature using the external FASTA

(I can fix the header myself) I wonder how I can use gff3_fix to remove ~1500 genes (gene, mRNA, exon, and CDS lines): is it possible to create a 4-column file to submit to -qc_r? Can I use any of the error codes that have a "delete_model" function? Is there a way to specify the gene ID instead of the line number?

Also, is there a feature to remove gene models whose protein sequence does not start with M? Thanks, Dario

dcopetti avatar Mar 03 '22 16:03 dcopetti

Hi @dcopetti - that's an interesting use case! I suppose you could hack a qc report file to get that done. The qc reports are line-based because not every feature in gff3 is required to have an ID. So you could provide the line number of the gene feature and assign it an error code that uses the delete_model function (https://github.com/NAL-i5K/GFF3toolkit/blob/master/docs/gff3_fix.py-documentation.rst). I've never tried this, but it might work.

The gff3toolkit doesn't have a function to flag or delete models with partial protein sequences.

mpoelchau avatar Mar 04 '22 18:03 mpoelchau

Thanks, I will try it next time! I found a way with gffread, using --nids and --keep_genes - mine was not a new problem after all.

dcopetti avatar Mar 04 '22 18:03 dcopetti