api icon indicating copy to clipboard operation
api copied to clipboard

URI encoding in Content State

Open azaroth42 opened this issue 1 year ago • 5 comments

The current 1.0 spec is non-deterministic about embedded URI decoding in the plain URI case.

The encoding requirement is:

When the content state is a plain URI, rather than a JSON object, it MUST NOT be content-state-encoded.

However it doesn't give an algorithm as how to interpret the URIs embedded in the query parameter in terms of URI encoding.

The previous decision for the proposed spec was to encode them (cf discovery#90) and then was changed to URI encode them after TRC feedback, but that URI encoding requirement never seems to have made it into the spec.

Currently I believe it is non-deterministic as to how a client should interpret the payload: iiif-content=https://example.org/manifest/%2550 -- is it %50 (1 encoding) or P (2 encodings) or %2550 (0 encodings) on the end?

azaroth42 avatar Apr 24 '24 20:04 azaroth42

I don't think there is any uncertainty about number of encoding/decoding actions. Per https://datatracker.ietf.org/doc/html/rfc3986#section-2.4 encoding SHOULD be done once when the URI is assembled from its component parts, and decoding MUST be done once when the URI is disassembled into its component parts (see also https://datatracker.ietf.org/doc/html/rfc3986#section-7.3). Attempts to do encoding/decoding separately from the URI assembly/disassembly steps will, in the general case, lead to errors.

Let's work though an example: If I build a URI from host1, path1 and query1 (leaving out fragment, assuming https scheme), I create a URI1 = https://ENC(host1)/ENC(path1)?ENC(query1) where ENC is my encoding function (omitting for now that ENC may be different for different components because they have different allowed characters).

Now say that I then build another URI that references URI1 in its query, with components host2, path2 and query iiif-content=URI1. I use the same procedure and get UR2 = https://ENC(host2)/ENC(path2)?iiif-content=ENC(https://ENC(host1)/ENC(path1)?ENC(query1)) -- there is double encoding of the embedded URI1 components and there must be double decoding when we reverse the two assembly processes.

Thus, to take Rob's example, if I am given a Content State URL:

https://example.com/viewer?iiif-content=https://example.org/manifest/%2550

this means (decode as we split) https://example.com/viewer (no change on decoding) should open https://example.org/manifest/%50 (%25 -> % on decode) which will then be interpreted as https//example.org/manifest/P (%50 -> P on decode) when split into its components.

(In this case we could just as well have used https://example.com/viewer?iiif-content=https://example.org/manifest/P or https://example.com/viewer?iiif-content=https://example.org/manifest/%50 because the P (like the / and . in the manifest URI) didn't need encoding)

zimeon avatar Apr 24 '24 22:04 zimeon

To put the above another way: We don't get a choice of specifying when something should or should not be URI encoded, it is already part of the URI spec (albeit with great latitude allowing much more than the bare minimum to be encoded if one chooses). We can add helpful reminders though.

zimeon avatar Apr 24 '24 23:04 zimeon

To work a more normal case (I chose %2550 as an easy one to read): a manifest and viewer that themselves have query params.

So, the manifest's URI is: https://example.org/builder?id=1&fmt=iiif The viewer's URI is https://example.com/viewer?fmt=iiif

We construct the parameter first: https://ENC(example.org)/ENC(builder)?ENC(id=1&fmt=iiif) --> https://example.org/builder?id=1%26fmt=iiif

(the & is part of the query part of the http URI syntax, so is encoded by the ? is part of the definition, and so is not, per https://datatracker.ietf.org/doc/html/rfc3986#section-3)

Then we add the query param iiif-content to the viewer URI to get https://example.com/viewer?fmt=iiif&iiif-content=

We then add the encoded param: https://example.com/viewer?fmt=iiif&iiif-content=https://example.org/builder?id=1%26fmt=iiif

And encode:

https://ENC(example.com)/ENC(viewer)?ENC(fmt=iiif&iiif-content=https://example.org/builder?id=1%26fmt=iiif)

with a result of

https://example.com/viewer?fmt=iiif%26iiif-content=https://example.org/builder%3Fid=1%2526fmt=iiif

Assuming that =, : and / are all safe, but ? and & are not.

If that's correct, and even if we're just following the rules, I think having that spelled out in no uncertain terms is necessary for folks to end up at the same conclusion.

azaroth42 avatar Apr 25 '24 01:04 azaroth42

@azaroth42 - I agree with the above except the encoding of & as %26 when building the manifest URI. I was sloppy when I suggested ENC(query) above, sorry!

This is not part of RFC3986 URI syntax requirements -- they generally would allow & to be unencoded and usually encoded would also be OK (I copied the relevant BNF syntax rules in https://github.com/IIIF/trc/issues/125#issuecomment-2075872243, & is in sub-delims).

However, the key1=value1&key2=value2 format for the query makes a distinction between & and = that might be in keys and values, and instances where they are part of the structure. I think this is defined by https://url.spec.whatwg.org/#application/x-www-form-urlencoded, that is what Python3's urllib.parse references). The decoding algorithm has: split on & then for each chunk split on first = and then decode/unquote the resulting key and value strings.

>>> import urllib
>>> urllib.parse.parse_qsl('id=1%26fmt=iiif')    # broken
[('id', '1&fmt=iiif')]
>>> urllib.parse.parse_qsl('id=1&fmt=iiif')      # good
[('id', '1'), ('fmt', 'iiif')]
>>> urllib.parse.parse_qsl('ke%3Dy1=val%3Due1&ke%26y2=val%26ue2&key3=second=unencoded') # complex!
[('ke=y1', 'val=ue1'), ('ke&y2', 'val&ue2'), ('key3', 'second=unencoded')]

To rerun your example:

  • Manifest URI encodes to: https://example.org/builder?id=1&fmt=iiif (no %)
  • Build viewer URI as https://example.com/viewer?fmt=iiif&iiif-content= + ENC(manifest-uri)
  • Where an almost minimal encoding would be https://example.com/viewer?fmt=iiif&iiif-content=https://example.com/builder?id%261%3Dfmt%26iiif
>>> import urllib
>>> p = urllib.parse.urlsplit('https://example.com/viewer?fmt=iiif&iiif-content=https://example.com/builder?id%261%3Dfmt%26iiif')
>>> p
SplitResult(scheme='https', netloc='example.com', path='/viewer', query='fmt=iiif&iiif-content=https://example.com/builder?id%261%3Dfmt%26iiif', fragment='')
>>> urllib.parse.parse_qsl(p.query)
[('fmt', 'iiif'), ('iiif-content', 'https://example.com/builder?id&1=fmt&iiif')]

(for fully minimal encoding, which I don't recommend, the additional = in the manifest URI don't need to be encoded because the split is on the first occurrence...

>>> urllib.parse.parse_qsl(urllib.parse.urlsplit('https://example.com/viewer?fmt=iiif&iiif-content=https://example.com/builder?id=1%3Dfmt=iiif').query)
[('fmt', 'iiif'), ('iiif-content', 'https://example.com/builder?id=1=fmt=iiif')]

)

zimeon avatar Apr 26 '24 00:04 zimeon

I'm convinced :) I think the above recipe should be published, preferably in the next version of the spec, and before then as a cookbook recipe.

azaroth42 avatar Apr 27 '24 13:04 azaroth42

Talk about URI encoding in section 3.1 where URIs contain URIs. (still agree that it's okay, just should be explicit)

azaroth42 avatar Mar 17 '25 15:03 azaroth42