AGAT
AGAT copied to clipboard
URL Escape Characters Converted
Describe the bug
agat_convert_sp_gff2gtf.pl
removes URL escape characters in the 9th column. In my testing, it removed a URL escape character in the 9th column which encodes for semicolons, i.e. ;
character. After running agat_convert_sp_gff2gtf.pl
, occurrences of %3B
are converted to ;
. As I understand, these URL encodings are used to prevent issues with parsing the GTF file later.
Is this behavior expected? Here is some documentation from your team. Please see the row about gff3 format. I already have a gff3 file (which is why the URL escape characters exist), but I would feel like the same rules would apply to GTF3 format. Wouldn't you want to avoid inserting a reserved delimiter character (like ';') within the value of a tag. This just makes parsing the file more of a headache later. I am not sure if the specification of gtf3 outlines how to handle said edge cases but it seems like retaining the URL escape character would be better.
I am interested to hear your thoughts.
Before (Rickettsia_rickettsii_str_iowa_gca_000017445.ASM1744v3.49.gff3): contains %3B
Chromosome ena ncRNA_gene 286157 288917 . + . ID=gene:RrIowa_0339;biotype=rRNA;description=Large Subunit Ribosomal RNA%3B lsuRNA%3B 23S ribosomal RNA;gene_id=RrIowa_0339;logic_name=ena_rna
After (Rickettsia_rickettsii_str_iowa_gca_000017445.ASM1744v3.49.gtf): converted %3B
Chromosome ena gene 286157 288917 . + . gene_id "RrIowa_0339"; ID "gene:RrIowa_0339"; biotype "rRNA"; description "Large Subunit Ribosomal RNA; lsuRNA; 23S ribosomal RNA"; logic_name "ena_rna"; original_biotype "ncrna_gene";
General (please complete the following information):
- AGAT version: 0.8.0
- Installed using singularity (from quay.io): see below
- OS: CentOS
To Reproduce I would just insert that character in a gff3 file you have and then run the following:
# Steps for converting messy gff into properly formatted GTF file
# 1. Pull image from registry and create SIF
# module load singularity
SINGULARITY_CACHEDIR=$PWD singularity pull \
docker://quay.io/biocontainers/agat:0.8.0--pl5262hdfd78af_0
# 2. Run AGAT todo the heavy lifting of gtf conversion
singularity exec -B $PWD \
agat_0.8.0--pl5262hdfd78af_0.sif agat_convert_sp_gff2gtf.pl \
--gff input.gff \
-o converted.gtf
If you would like, I can provide you with the exact gff3 I am using. Please let me know what you think.
Expected behavior I am not sure if this is expected behavior or not based on the specification of gtf3. Maybe there is no guidance, and we live in the wild, wild west.
Here is some code to convert semicolons within quotes back into URL escape characters:
tmp = 'gene_id "RrIowa_0339"; ID "gene:RrIowa_0339"; biotype "rRNA"; description "Large Subunit Ribosomal RNA; lsuRNA; 23S ribosomal RNA"; logic_name "ena_rna"; original_biotype "ncrna_gene"'
# Assumes the quote character in the 9th column is a double quote or <"> character. This is the
# correct character to use based on the speficiation. More information can be found on here:
# https://github.com/NBISweden/GAAS/blob/master/annotation/knowledge/gxf.md#main-points-and-differences-between-gtf-formats
def url_escape_inside_quotes(line, delimiter=';', url_encoding = '%3B'):
quote_count = 0
inside_quotes = False
fixed = ''
for c in line:
if c == '"':
# Entered the border or ending of
# a quote, increase the counter and
# check where we are in the string
quote_count += 1
inside_quotes = True
if quote_count > 1:
# Reached end border of quote,
# reset boolean flag and counters
inside_quotes = False
quote_count = 0
if inside_quotes:
# Replace reserved delimeter with
# another character, let's use a
# url encoding of the character
if c == delimiter:
c = url_encoding
# Add the existing/converted character
fixed += c
return fixed
# gene_id "RrIowa_0339"; ID "gene:RrIowa_0339"; biotype "rRNA"; description "Large Subunit Ribosomal RNA%3B lsuRNA%3B 23S ribosomal RNA"; logic_name "ena_rna"; original_biotype "ncrna_gene"
print(url_escape_inside_quotes(tmp))
in GFF3
URL escaping rules are used for tags or values containing the following characters: ",=;". Spaces are allowed in this field, but tabs must be replaced with the %09 URL escape.
The piece of code dealing with that in AGAT is the same for GFF and GTF so I will try to fix that. GTF do not have any official rule about it. As they quote textual value, it is not a problem to escape it or not.
Okay, that sounds good @Juke34.
Thank you for taking the time to look deeper into this issue. I appreciate it!