AGAT icon indicating copy to clipboard operation
AGAT copied to clipboard

URL Escape Characters Converted

Open skchronicles opened this issue 2 years ago • 3 comments

Describe the bug agat_convert_sp_gff2gtf.pl removes URL escape characters in the 9th column. In my testing, it removed a URL escape character in the 9th column which encodes for semicolons, i.e. ; character. After running agat_convert_sp_gff2gtf.pl, occurrences of %3B are converted to ;. As I understand, these URL encodings are used to prevent issues with parsing the GTF file later.

Is this behavior expected? Here is some documentation from your team. Please see the row about gff3 format. I already have a gff3 file (which is why the URL escape characters exist), but I would feel like the same rules would apply to GTF3 format. Wouldn't you want to avoid inserting a reserved delimiter character (like ';') within the value of a tag. This just makes parsing the file more of a headache later. I am not sure if the specification of gtf3 outlines how to handle said edge cases but it seems like retaining the URL escape character would be better.

I am interested to hear your thoughts.

Before (Rickettsia_rickettsii_str_iowa_gca_000017445.ASM1744v3.49.gff3): contains %3B

Chromosome	ena	ncRNA_gene	286157	288917	.	+	.	ID=gene:RrIowa_0339;biotype=rRNA;description=Large Subunit Ribosomal RNA%3B lsuRNA%3B 23S ribosomal RNA;gene_id=RrIowa_0339;logic_name=ena_rna

After (Rickettsia_rickettsii_str_iowa_gca_000017445.ASM1744v3.49.gtf): converted %3B

Chromosome	ena	gene	286157	288917	.	+	.	gene_id "RrIowa_0339"; ID "gene:RrIowa_0339"; biotype "rRNA"; description "Large Subunit Ribosomal RNA; lsuRNA; 23S ribosomal RNA"; logic_name "ena_rna"; original_biotype "ncrna_gene";

General (please complete the following information):

  • AGAT version: 0.8.0
  • Installed using singularity (from quay.io): see below
  • OS: CentOS

To Reproduce I would just insert that character in a gff3 file you have and then run the following:

# Steps for converting messy gff into properly formatted GTF file
# 1. Pull image from registry and create SIF
# module load singularity 
SINGULARITY_CACHEDIR=$PWD singularity pull \
    docker://quay.io/biocontainers/agat:0.8.0--pl5262hdfd78af_0 

# 2. Run AGAT todo the heavy lifting of gtf conversion
singularity exec -B $PWD \
    agat_0.8.0--pl5262hdfd78af_0.sif agat_convert_sp_gff2gtf.pl \
        --gff input.gff \
        -o converted.gtf

If you would like, I can provide you with the exact gff3 I am using. Please let me know what you think.

Expected behavior I am not sure if this is expected behavior or not based on the specification of gtf3. Maybe there is no guidance, and we live in the wild, wild west.

skchronicles avatar May 06 '22 17:05 skchronicles

Here is some code to convert semicolons within quotes back into URL escape characters:

tmp = 'gene_id "RrIowa_0339"; ID "gene:RrIowa_0339"; biotype "rRNA"; description "Large Subunit Ribosomal RNA; lsuRNA; 23S ribosomal RNA"; logic_name "ena_rna"; original_biotype "ncrna_gene"'

# Assumes the quote character in the 9th column is a double quote or <"> character. This is the 
# correct character to use based on the speficiation. More information can be found on here:
# https://github.com/NBISweden/GAAS/blob/master/annotation/knowledge/gxf.md#main-points-and-differences-between-gtf-formats
def url_escape_inside_quotes(line, delimiter=';', url_encoding = '%3B'):
    quote_count = 0
    inside_quotes = False
    fixed = ''
    for c in line:
        if c == '"':
            # Entered the border or ending of 
            # a quote, increase the counter and
            # check where we are in the string
            quote_count += 1
            inside_quotes = True

            if quote_count > 1:
                # Reached end border of quote,
                # reset boolean flag and counters
                inside_quotes = False
                quote_count = 0

        if inside_quotes:
            # Replace reserved delimeter with 
            # another character, let's use a 
            # url encoding of the character
            if c == delimiter:
                c = url_encoding

        # Add the existing/converted character 
        fixed += c
    
    return fixed 

# gene_id "RrIowa_0339"; ID "gene:RrIowa_0339"; biotype "rRNA"; description "Large Subunit Ribosomal RNA%3B lsuRNA%3B 23S ribosomal RNA"; logic_name "ena_rna"; original_biotype "ncrna_gene"
print(url_escape_inside_quotes(tmp)) 

skchronicles avatar May 06 '22 18:05 skchronicles

in GFF3 URL escaping rules are used for tags or values containing the following characters: ",=;". Spaces are allowed in this field, but tabs must be replaced with the %09 URL escape.

The piece of code dealing with that in AGAT is the same for GFF and GTF so I will try to fix that. GTF do not have any official rule about it. As they quote textual value, it is not a problem to escape it or not.

Juke34 avatar May 17 '22 13:05 Juke34

Okay, that sounds good @Juke34.

Thank you for taking the time to look deeper into this issue. I appreciate it!

skchronicles avatar May 17 '22 20:05 skchronicles