gff3sort
gff3sort copied to clipboard
GFF3sort: A Perl Script to sort gff3 files and produce suitable results for tabix tools
GFF3sort
A Perl Script to sort gff3 files and produce suitable results for tabix tools
Usage
gff3sort.pl [input GFF3 file] >output.sort.gff3
Optional Parameters:
--precise Run in precise mode, about 2X~3X slower than the default mode.
Only needed to be used if your original GFF3 files have parent
features appearing behind their children features.
--chr_order Select how the chromosome IDs should be sorted.
Acceptable values are: alphabet, natural, original
[Default: alphabet]
--extract_FASTA If the input GFF3 file contains FASTA sequence at the end, use this
option to extract the FASTA sequence and place in a separate file
with the extention '.fasta'. By default, the FASTA sequences would be
discarded.
Publication
Zhu T, Liang C, Meng Z, Guo S, Zhang R: GFF3sort: A novel tool to sort GFF3 files for tabix indexing. BMC Bioinformatics 2017, 18:482, https://doi.org/10.1186/s12859-017-1930-3
Background
The tabix tool from htslib requires files sorted by their chromosomes and positions. For GFF3 files, they would be sorted by column 1 (chromosomes) and 4 (start positions) as:
sort -k1,1 -k4,4n myfile.gff > myfile.sorted.gff
(OR)
gt gff3 -sortlines -tidy -retainids myfile.gff > myfile.sorted.gff
Then, the sorted GFF3 file could be indexed by:
bgzip myfile.sorted.gff
tabix -p gff myfile.sorted.gff.gz
However, either the GNU sort or the gt tool has a bug: Lines with the same chromosomes and start positions would be placed randomly. Therefore, parent feature lines might sometimes be placed after their children lines. For example, the following features:
##gff-version 3
###
A01 Cufflinks mRNA 473 6154 . - . ID=XLOC_001154.41;description=Novel: Intergenic transcript
A01 Cufflinks exon 473 814 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 1626 2574 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 2695 2721 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 3637 3726 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 5329 5408 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 5994 6154 . - . Parent=XLOC_001154.41
###
A01 Cufflinks mRNA 473 6386 . - . ID=XLOC_001154.42;description=Novel: Intergenic transcript
A01 Cufflinks exon 473 2024 . - . Parent=XLOC_001154.42
A01 Cufflinks exon 2615 2721 . - . Parent=XLOC_001154.42
A01 Cufflinks exon 3637 3726 . - . Parent=XLOC_001154.42
A01 Cufflinks exon 5329 6386 . - . Parent=XLOC_001154.42
would be sorted as:
##gff-version 3
##sequence-region A01 473 6386
A01 Cufflinks exon 473 814 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 473 2024 . - . Parent=XLOC_001154.42
A01 Cufflinks mRNA 473 6154 . - . ID=XLOC_001154.41;description=Novel: Intergenic transcript
A01 Cufflinks mRNA 473 6386 . - . ID=XLOC_001154.42;description=Novel: Intergenic transcript
A01 Cufflinks exon 1626 2574 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 2615 2721 . - . Parent=XLOC_001154.42
A01 Cufflinks exon 2695 2721 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 3637 3726 . - . Parent=XLOC_001154.42
A01 Cufflinks exon 3637 3726 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 5329 5408 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 5329 6386 . - . Parent=XLOC_001154.42
A01 Cufflinks exon 5994 6154 . - . Parent=XLOC_001154.41
###
That is, the two mRNA lines start with pos 473 would be "randomly" placed after the two exon lines which also start with pos 473. These would encount bugs such as https://github.com/GMOD/jbrowse/issues/780
This script would adjust lines with the same start positions. It would move lines with "Parent=" attributes (case insensitive) behind lines without "Parent=" attributes. The result would be:
A01 Cufflinks mRNA 473 6386 . - . ID=XLOC_001154.42;description=Novel: Intergenic transcript
A01 Cufflinks mRNA 473 6154 . - . ID=XLOC_001154.41;description=Novel: Intergenic transcript
A01 Cufflinks exon 473 814 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 473 2024 . - . Parent=XLOC_001154.42
A01 Cufflinks exon 1626 2574 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 2615 2721 . - . Parent=XLOC_001154.42
A01 Cufflinks exon 2695 2721 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 3637 3726 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 3637 3726 . - . Parent=XLOC_001154.42
A01 Cufflinks exon 5329 5408 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 5329 6386 . - . Parent=XLOC_001154.42
A01 Cufflinks exon 5994 6154 . - . Parent=XLOC_001154.41