genometools
genometools copied to clipboard
gff3 sort warnings
Problem description
I am trying to sort a large (10.9 GB) gff file for input into a genome browser for viewing. The command has been running for 6 days with warnings, no errors, and no output being written to the output file. Should I expect the sorting to take this long with such a large file? I am specifically concerned about about the 'more than one' warning with the result 'join them'. What does this mean?
warning: seqid "contig_1960_pilon" on line 3783 in file "/share2/rmaher/dungeness-rad/MAKER/Dungeness.all.rnd3.gff" has not been previously introduced with a "##sequence-region" line, create such a line automatically
warning: more than one aligned_coverage attribute on line 16065 in file "/share2/rmaher/dungeness-rad/MAKER/Dungeness.all.rnd3.gff"; join them
warning: more than one aligned_identity attribute on line 16065 in file "/share2/rmaher/dungeness-rad/MAKER/Dungeness.all.rnd3.gff"; join them
Exact command line call triggering the problem
gt gff3 -sortlines -tidy /MAKER/Dungeness.all.rnd2.gff > MAKER/Dungeness.all.rnd2.sorted.gff
Example minimal input triggering the problem
What GenomeTools version are you reporting an issue for (as output by gt -version)?
gt (GenomeTools) 1.6.1
Did you compile GenomeTools from source? If so, please state the make parameters used.
Unsure.
What operating system (e.g. Ubuntu, Mac OS X), OS version (e.g. 15.10, 10.11) and platform (e.g. x86_64) are you using?
MacOS Mojave Version 10.14.6
Sorting can take some time indeed, but 6 days seem to be a bit long. Can you try just parsing the file with gt gff3 ... > /dev/null to see if the file can be processed completely in the first place without sorting?
The messages about duplicate attribute refer to situations like this (line 16065 in the example file):
contig_1960_pilon protein_gff:protein2genome protein_match 39281 40511 1064 - . ID=contig_1960_pilon:hit:3459018:3.12.0.0;Name=evm.model.ctg99_pilon_pilon.7;target_length=1231;aligned_coverage=33.14;aligned_identity=100;aligned_coverage=33.14,52.99;aligned_identity=100,52.4;score=1064,1064;target_length=1231,770
In this line, the attributes aligned_coverage, aligned_identity and target_length appear more than once. To make sense of this, the GenomeTools parser joins both occurrences for each attribute into one (separated by a comma), for example in this case the result would be:
contig_1960_pilon protein_gff:protein2genome protein_match 39281 40511 1.06e+03 - . ID=protein_match1;Name=evm.model.ctg99_pilon_pilon.7;target_length=1231,1231,770;aligned_coverage=33.14,33.14,52.99;aligned_identity=100,100,52.4;score=1064,1064
The messages are just to notify about GenomeTools doing that, as it may be an indication of some unintended issue with your input data, it is not an error and very unlikely to be the reason for your sorting runtime.
10.9 GB is a huge file. Should not be a problem in principle, but does your machine have enough memory so that it doesn't start swapping?
You can also try the combination of -o and -v. That will give you a progress report in the shell.