bcbb icon indicating copy to clipboard operation
bcbb copied to clipboard

over url-encoding in attribute fields

Open bobbyo opened this issue 11 years ago • 5 comments

In trying to add a Name=value field to my data, and have GFFOutput.py write it, I find that the value field is being fully URL encoded, which is different from the gff3 specification. In my case, it means attributes like: NAME=jgi.p|Schco3|1037802 end up urlencoded like this: NAME=jgi.p%7CSchco3%7C1037802 which causes problems with our downstream data use. I believe these should not be escaped according to the gff3 standard. The gff3 standard v 1.21 says:

URL escaping rules are used for tags or values containing the following characters: ",=;". Spaces are allowed in this field, but tabs must be replaced with the %09 URL escape.  -- http://www.sequenceontology.org/gff3.shtml 

So the rule seems to be:

  1. attribute key or value variables should be fully URL escaped when they contain ",=;"
  2. attribute key or value TAB characters should always be escaped, but having TAB does not trigger full url encoding of that key or value

The attribute key and value in NAME=jgi.p|Schco3|1037802 do not contain ",=;". Hence this should not be escaped.

Do you agree? Would you like a patch to GFFOutput.py that provides a routine following those rules for escaping values?

bobbyo avatar Apr 18 '14 22:04 bobbyo

Bobby; That would be great. I wish the spec had a more consistent and standard quoting approach instead of something custom, hence my use of urllib.quote/unquote. If it's causing issues with downstream tools, it would make sense to clean it up and I'd be happy to accept a patch. Sorry about the issues and thanks for looking at this.

chapmanb avatar Apr 21 '14 15:04 chapmanb

Here is a patch; feel free to tighten/modify as you wish.

The gff3 standard seems to make using the encoding it a bit tough, as how does one know when URL-encoding like procedures have been used, e.g. I'm not clear on how you know for certain to use URL-decoding when reading the gff3 data back in. But this patch does apply the *encoding *that the gff3 standard seems to be requesting. I confess that in the case that caused me to write the patch, the standard suggests the data should not be encoded, which is the use case I tested.

Best, Bobby O

On Mon, Apr 21, 2014 at 8:25 AM, Brad Chapman [email protected]:

Bobby; That would be great. I wish the spec had a more consistent and standard quoting approach instead of something custom, hence my use of urllib.quote/unquote. If it's causing issues with downstream tools, it would make sense to clean it up and I'd be happy to accept a patch. Sorry about the issues and thanks for looking at this.

— Reply to this email directly or view it on GitHubhttps://github.com/chapmanb/bcbb/issues/86#issuecomment-40943758 .


Robert P Otillar, PhD Bioinformatics Analyst Joint Genome Institute Genomic Annotation Division 2800 Mitchell Drive Walnut Creek, CA 94598 Tel: 925-296-5786 Fax: 925-296-5752

[email protected]

bobbyo avatar May 09 '14 22:05 bobbyo

Bobby; Thanks much for looking at this. I didn't see a patch in your reply. Could you send a pull request, or post the patch as a Gist? Thanks again.

chapmanb avatar May 10 '14 18:05 chapmanb

Sorry; oddly I did see it attached to my earlier email; here it is again. I definitely see it attached to this email, as attachment:

GFFOutput.col9_encoding_fix.patch (2k)

Let me know if it does not come through.

-B

On Sat, May 10, 2014 at 11:37 AM, Brad Chapman [email protected]:

Bobby; Thanks much for looking at this. I didn't see a patch in your reply. Could you send a pull request, or post the patch as a Gist? Thanks again.

— Reply to this email directly or view it on GitHubhttps://github.com/chapmanb/bcbb/issues/86#issuecomment-42750290 .


Robert P Otillar, PhD Bioinformatics Analyst Joint Genome Institute Genomic Annotation Division 2800 Mitchell Drive Walnut Creek, CA 94598 Tel: 925-296-5786 Fax: 925-296-5752

[email protected]

bobbyo avatar May 15 '14 00:05 bobbyo

Bobby; These e-mails come in as GitHub issue comments, and it looks like they remove attachments so I'm not getting it. You can see them on the issue page:

https://github.com/chapmanb/bcbb/issues/86

A Gist (https://gist.github.com/) with the patch is probably the best approach. Thanks again.

chapmanb avatar May 15 '14 09:05 chapmanb