genometools icon indicating copy to clipboard operation
genometools copied to clipboard

gff3 sort warnings

Open maherrl opened this issue 5 years ago • 2 comments

Problem description

I am trying to sort a large (10.9 GB) gff file for input into a genome browser for viewing. The command has been running for 6 days with warnings, no errors, and no output being written to the output file. Should I expect the sorting to take this long with such a large file? I am specifically concerned about about the 'more than one' warning with the result 'join them'. What does this mean?

warning: seqid "contig_1960_pilon" on line 3783 in file "/share2/rmaher/dungeness-rad/MAKER/Dungeness.all.rnd3.gff" has not been previously introduced with a "##sequence-region" line, create such a line automatically
warning: more than one aligned_coverage attribute on line 16065 in file "/share2/rmaher/dungeness-rad/MAKER/Dungeness.all.rnd3.gff"; join them
warning: more than one aligned_identity attribute on line 16065 in file "/share2/rmaher/dungeness-rad/MAKER/Dungeness.all.rnd3.gff"; join them

Exact command line call triggering the problem

gt gff3 -sortlines -tidy /MAKER/Dungeness.all.rnd2.gff > MAKER/Dungeness.all.rnd2.sorted.gff

Example minimal input triggering the problem

Dungeness.all.rnd3.sub.txt

What GenomeTools version are you reporting an issue for (as output by gt -version)?

gt (GenomeTools) 1.6.1

Did you compile GenomeTools from source? If so, please state the make parameters used.

Unsure.

What operating system (e.g. Ubuntu, Mac OS X), OS version (e.g. 15.10, 10.11) and platform (e.g. x86_64) are you using?

MacOS Mojave Version 10.14.6

maherrl avatar Jul 29 '20 19:07 maherrl

Sorting can take some time indeed, but 6 days seem to be a bit long. Can you try just parsing the file with gt gff3 ... > /dev/null to see if the file can be processed completely in the first place without sorting?

The messages about duplicate attribute refer to situations like this (line 16065 in the example file):

contig_1960_pilon       protein_gff:protein2genome      protein_match   39281   40511   1064    -       .       ID=contig_1960_pilon:hit:3459018:3.12.0.0;Name=evm.model.ctg99_pilon_pilon.7;target_length=1231;aligned_coverage=33.14;aligned_identity=100;aligned_coverage=33.14,52.99;aligned_identity=100,52.4;score=1064,1064;target_length=1231,770

In this line, the attributes aligned_coverage, aligned_identity and target_length appear more than once. To make sense of this, the GenomeTools parser joins both occurrences for each attribute into one (separated by a comma), for example in this case the result would be:

contig_1960_pilon	protein_gff:protein2genome	protein_match	39281	40511	1.06e+03	-	.	ID=protein_match1;Name=evm.model.ctg99_pilon_pilon.7;target_length=1231,1231,770;aligned_coverage=33.14,33.14,52.99;aligned_identity=100,100,52.4;score=1064,1064

The messages are just to notify about GenomeTools doing that, as it may be an indication of some unintended issue with your input data, it is not an error and very unlikely to be the reason for your sorting runtime.

satta avatar Jul 30 '20 08:07 satta

10.9 GB is a huge file. Should not be a problem in principle, but does your machine have enough memory so that it doesn't start swapping?

You can also try the combination of -o and -v. That will give you a progress report in the shell.

gordon avatar Jul 30 '20 11:07 gordon