spec icon indicating copy to clipboard operation
spec copied to clipboard

define URI_ENCODE

Open ghost opened this issue 10 years ago • 71 comments

URI_ENCODE is not defined in the specification. It probably refers to this: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURI but this needs to be made explicit.

Also: it should be made explicit whether or not the server needs to decode this, and how, and what should be shown in folder listings, the encoded or decoded variant.

ghost avatar Oct 30 '15 10:10 ghost

JS behavior:

encodeURI('http://www.example.org/foo/bar/baz foo.txt');

/*
http://www.example.org/foo/bar/baz%20foo.txt
*/

decodeURI('http://www.example.org/foo/bar/baz%20foo.txt');

/*
http://www.example.org/foo/bar/baz foo.txt
*/

decodeURI('http://www.example.org/foo/bar/baz%2ffoo.txt');
/*
http://www.example.org/foo/bar/baz%2ffoo.txt
*/

ghost avatar Oct 30 '15 15:10 ghost

That looks correct.

raucao avatar Oct 30 '15 16:10 raucao

So ideally we describe the (complete) mapping between what is supposed to go out of the client, and what is supposed to be in the directory listing and describe allowed/disallowed value.

For example, what happens when the client sends: https://storage.example/foo/bar%2fbaz.txt? Should this be represented as https://storage.example/foo/bar/baz.txt or rejected with a 400? What happens when the client send foo%2e.txt? Represented as foo.txt? Etc.

ghost avatar Nov 03 '15 10:11 ghost

None of those choices are really enforcable. Depending on which proxies you install (reverse-proxies or otherwise), you'll get different behavior that isn't controllable by the server.

IMO if you decide on one, it should come with a very strong recommendation for clients to not rely on this behavior. But if you specify behavior which can't be relied on, why specify it in the first place?

untitaker avatar Nov 03 '15 18:11 untitaker

All very true!

What was the rationale to have this URI_ENCODE requirements in clients in the first place? Also, the client should possibly be able to "understand" encoded entries in directory listings... I guess that would solve all issues?

Update: or maybe clients do not need to understand, they should just not fiddle with entries from the directory listing, but use them as-is without calling URI_ENCODE again...

ghost avatar Nov 03 '15 18:11 ghost

Also, the client should possibly be able to "understand" encoded entries in directory listings... I guess that would solve all issues?

You mean the client should be able to deal with both encoded and unencoded entries? Having to do that in practice would be a worst-case scenario, but I don't think we should require it...

Update: or maybe clients do not need to understand, they should just not fiddle with entries from the directory listing, but use them as-is without calling URI_ENCODE again...

The 5apps storage doesn't seem to urlencode unsafe characters in directory listings. I.e., it gives me this listing:

{
    "@context": "...",
    "items": {"äää": ...}
}

but on PUT requests I obviously have to encode äää (and do).

untitaker avatar Nov 04 '15 18:11 untitaker

var url = "http://localhost/a b c.jpg";
var url = encodeURI("http://localhost/a b c.jpg");

Using XMLHttpRequest() with those URLs results in the exact same request. The space gets encoded to %20.

Now,

var url = "http://localhost/a b%20c.jpg";
var url = encodeURI("http://localhost/a b%20c.jpg");

Results in two different request URIs: http://localhost/a%20b%20c.jpg and http://localhost/a%20b%2520c.jpg (double encoding in the second scenario).

To sum up: the specification requests that the client uses URI_ENCODE, which is of course necessary! But the browser already does that, so the library should not do that again...

If URI_ENCODE is mentioned in the spec then it needs to make explicit which encoding it is, talk about double encoding and define a mapping between request URL (from the client) to how the server should interpret it, e.g. how to show it in directory listing.

ghost avatar Nov 04 '15 18:11 ghost

But the browser already does that, so the library should not do that again...

The browser is part of the API client. Whether we call encodeURI or the browser doesn't matter as long as it results in a correctly formatted HTTP request.

If URI_ENCODE is mentioned in the spec then it needs to make explicit which encoding it is, talk about double encoding and define a mapping between request URL (from the client) to how the server should interpret it, e.g. how to show it in directory listing.

We might as well drop any mention of URI_ENCODE from the spec, and clearly specify that nodenames in folder listings should not be URL encoded. That nodenames have to be urlencoded in URLs is already required by other rfcs.

untitaker avatar Nov 04 '15 18:11 untitaker

The browser is part of the API client. Whether we call encodeURI or the browser doesn't matter as long as it results in a correctly formatted HTTP request.

No it is not. The client could also be cURL, which does not URL encode by itself, which is actually funny:

[fkooman@localhost ~]$ curl -v "http://localhost/foo bar baz.txt"
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 80 (#0)
> GET /foo bar baz.txt HTTP/1.1
> User-Agent: curl/7.40.0
> Host: localhost
> Accept: */*

We might as well drop any mention of URI_ENCODE from the spec, and clearly specify that nodenames in folder listings should not be URL encoded.

That would be nice! But again, what is URL encoded? it has to be made explicit somewhere...

That nodenames have to be urlencoded in URLs is already required by other rfcs.

Which ones? I guess you are right, but referencing them would be nice :)

ghost avatar Nov 04 '15 19:11 ghost

Which ones? I guess you are right, but referencing them would be nice :)

https://tools.ietf.org/html/rfc3986

You just can't have spaces in paths.

URL-encoding is also specified in that RFC.

untitaker avatar Nov 04 '15 19:11 untitaker

No it is not. The client could also be cURL, which does not URL encode by itself

Right, such a request would violate the spec.

Why would the server care whether the browser or the JS code URL-encodes? Either result in valid requests => we have a conformant client.

untitaker avatar Nov 04 '15 19:11 untitaker

From that RFC:

Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI. Implementations must not percent-encode or decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding, or vice versa in the case of percent-encoding an already percent-encoded string.

rs.js and browser both encode, so it is done twice in the client.

ghost avatar Nov 04 '15 19:11 ghost

The browser only does that on broken URLs, as you can clearly see from the first example you posted. It's not a simple URI-encode, it's decode-and-encode. Or something more complicated.

In the second example the first URI is fixed up again by replacing the unsafe space with %20, the second URI is just broken.

untitaker avatar Nov 04 '15 19:11 untitaker

You lost me. The whole point is to send "non-broken" URLs in requests, why else would you encode? Just for the fun of it?

ghost avatar Nov 04 '15 19:11 ghost

var url = "http://localhost/a b c.jpg";
var url = "http://localhost/a%20b%20c.jpg";

They result in the same request, but for different reasons.

  1. The first URL is broken. The browser sees that spaces are not allowed in paths, and fixes it.
  2. The second URL is completely correct. The browser doesn't attempt to fix it up and sends it as-is.

This isn't a simple URL-encode done by the browser, otherwise the second URL would be doubly encoded. It isn't, otherwise you'd notice for sure.

It's much more closer to "url-encode all unsafe characters, except for %".

This isn't some algorithm that's specified in an RFC, it's something browsers came up with to work around defective websites.

untitaker avatar Nov 04 '15 19:11 untitaker

Great explanation. I just deleted my half-written worse one. :)

raucao avatar Nov 04 '15 19:11 raucao

Ah, so the browser leaves properly encoded URLs alone and does not modify them anymore, so there is only one encoding in that situation. Cool :)

Now we only need to define what URI_ENCODE does exactly (rs.js uses encodeURI) and how to reverse that encoding.

ghost avatar Nov 04 '15 20:11 ghost

@fkooman ok, can you take care of making this clearer in the spec?

michielbdejong avatar Nov 06 '15 08:11 michielbdejong

@michielbdejong sure, do you have any additional thoughts on this issue?

ghost avatar Nov 06 '15 08:11 ghost

(And maybe a hint where URI_ENCODE came from. ;))

raucao avatar Nov 06 '15 08:11 raucao

We had a long discussion about that in 2011, about how "a b c.jpg" and "a%20b%20c.jpg" are the same resource on the server, and especially also how a%2Fb.jpg and a%2fb.jpg are the same resource, but a%2Fb.jpg and a%2FB.jpg are not. It's all pretty obvious when you realize that the server should only think about the URL paths after decoding them.

Seems URI_ENCODE is defined in RFC3986? https://en.wikipedia.org/wiki/Percent-encoding#Current_standard

michielbdejong avatar Nov 06 '15 08:11 michielbdejong

how a%2Fb.jpg and a%2fb.jpg are the same resource, but a%2Fb.jpg and a%2FB.jpg are not.

That much is already obvious. They're the same hex value. Confusion was more regarding a/b%2f/c vs a%2fb%2fc, or a@b vs a%40b. Nginx treats those as the same value, Apache doesn't. I'm not really sure whether we need to clarify this kind of behavior in the remoteStorage RFC.

Seems URI_ENCODE is defined in RFC3986?

Not with that exact name at least.

untitaker avatar Nov 06 '15 16:11 untitaker

Slashes have a function in our URLs, ampersands don't. So I think a/b%2f/c and a%2fb%2fc are both allowed and different, a%40b is also allowed, and behavior for a@b is undefined, right?

michielbdejong avatar Nov 17 '15 13:11 michielbdejong

I think what would be really helpful is to have a mapping between the request URI and how it shows up in the folder listing.

This can be done by pointing to a definition of URI_ENCODE/DECODE and be done with it once and for all :)

ghost avatar Nov 17 '15 13:11 ghost

Item names MAY contain all characters except '/' and the null character

So we don't have to worry about how to deal with URLs that contain encoded slashes, it will always be clear where each item name starts and stops. Let's choose the decodeURIComponent JavaScript function as the official way to determine which item name is meant by a certain string between two slashes, or after the last slash.

michielbdejong avatar Nov 17 '15 14:11 michielbdejong

The server should probably respond with something like a 400 if the URL refers to item names that contain slashes or null characters, because those are not allowed in item names.

michielbdejong avatar Nov 17 '15 14:11 michielbdejong

Or we can just say the server's behavior is undefined in this case.

michielbdejong avatar Nov 17 '15 14:11 michielbdejong

Can we make this list explicit? %2F, %2f, /, %00?

ghost avatar Nov 17 '15 14:11 ghost

%2F, %2f, %00?

ghost avatar Nov 17 '15 14:11 ghost

check if request URI contains any of %2F, %2f, %00
    return 400;
split request uri (after storage root) on '/'.
foreach component -> decodeURIcomponent

ghost avatar Nov 17 '15 14:11 ghost