jsurl icon indicating copy to clipboard operation
jsurl copied to clipboard

Using () to delimit objects breaks auto-url-detectors

Open wmertens opened this issue 8 years ago • 40 comments

if you embed a jsurl object result in a url as the last component, you get something like http://example.com/foo?q=~(a~'test), and if you paste that somewhere, there's a good chance that the url up but not including the final ) is recognized.

One option is adding a final ~, that fixes it?

wmertens avatar Mar 28 '17 16:03 wmertens

I could implement this in a v2 but the problem is that a string produced by a v2 will fail to parse with a v1 parser. So far I have resisted making changes because I did not want to break protocols that use jsurl.

bjouhier avatar Mar 28 '17 19:03 bjouhier

Well, I respect that, but you can always call the encoding jsurl2 and make it clear there is no compatibility except in spirit…

My usage so far was to encode data for consumption by the same application, and I would guess that that is the major use case…

On Tue, Mar 28, 2017, 9:45 PM Bruno Jouhier [email protected] wrote:

I could implement this in a v2 but the problem is that a string produced by a v2 will fail to parse with a v1 parser. So far I have resisted making changes because I did not want to break protocols that use jsurl.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Sage/jsurl/issues/16#issuecomment-289882744, or mute the thread https://github.com/notifications/unsubscribe-auth/AADWlsJTDVput2un6AQV7qnquSYnravtks5rqWNHgaJpZM4Mr2qb .

wmertens avatar Mar 28 '17 21:03 wmertens

Our situation is different because our app has several components that interact with jsurl and it is more difficult to move them all at once (especially as our components are deployed on-premise). So we need to preserve interop.

But I'm not opposed to fixing the issues with a v2. We should solve all the pending issues at once (encoded quote and trailing ~) so that we don't have to move again later.

bjouhier avatar Mar 29 '17 09:03 bjouhier

So, how about changing the initial character for jsurl2? That way, you can parse ~ starting strings as v1 and = (or whatever) as v2

For the (), I realized that as you descend into a JS value, there are only a few possibilities, so if you drop some robustness, you can use any valid character to delimit blocks.

Furthermore, while parsing the inside of a block, you only need 2 characters: one to stop the block and one to go deeper. Normally these are ) and (, but they could also change on every level. So you could delimit the first block with / (will be part of the url even at end) and then alternating with | and / (for example): =/name~"John*20Doe~age~42~children~|~"Mary~"Bill|/

In fact, at each split point of the JSON structures at http://www.json.org/, you can use a different set of encoding characters. The example could also be e.g. =/name~John*_Doe~age~42~childrenMary~Bill~/, or even =/!0~John*_Doe~!1~42~!2Mary~!3~/ (with pre-shared dictionary):

  • /, | and * start objects/arrays depending on level (rotate the set on every level, note that * is not needed for escaping here)
  • " or any a-zA-Z start a string. " is only needed if a string does not start with alpha
  • -, 0-9 and . start a number, so a decimal can be .5
  • !/, !|, !* can be true, false and null. That leaves lots of address space in ! to refer to a pre-shared dictionary. Keys starting with ! could also refer to that dictionary.
  • inside properties and strings, *_ encodes a space. all of *x is available if *XX requires uppercase. E.g. *! *~ */ *| **

That should make for shorter encodings that still are fairly readable.

For robustness, a short fixed-size checksum could be added to the end, e.g. 2 characters taking the sum of all character values plus the string length, module 64^2, base64 (is that url safe?)

On Wed, Mar 29, 2017 at 11:03 AM Bruno Jouhier [email protected] wrote:

Our situation is different because our app has several components that interact with jsurl and it is more difficult to move them all at once (especially as our components are deployed on-premise). So we need to preserve interop.

But I'm not opposed to fixing the issues with a v2. We should solve all the pending issues at once (encoded quote and trailing ~) so that we don't have to move again later.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Sage/jsurl/issues/16#issuecomment-290028723, or mute the thread https://github.com/notifications/unsubscribe-auth/AADWlntQzv1dlE7GFNp-1MZUfAttNwUEks5rqh5egaJpZM4Mr2qb .

wmertens avatar Mar 29 '17 12:03 wmertens

I was thinking about less invasive changes. I would like to keep the parentheses. If we add a ~ at the end, do we still have a problem with parentheses?

I want the encoded string to be unaltered by encodeURIComponent (this was a strong requirement for v1). This limits the character set to ascii alpha + ascii digits + - _ . ! ~ * ' ( ) (uriUnescaped in https://www.ecma-international.org/ecma-262/5.1/#sec-15.1.3) and I would restrict even further, and eliminate '. This rules out characters like = / |.

So I'm proposing the following changes:

  • add a ~ as the end, to keep the auto-url-detectors happy. This trailing char can also be used to distinguish between v1 and v2
  • replace ' by !, to avoid browser encoding.
  • maybe a few special * escapes. I like *_ for space, maybe *- for $ (frequent in object keys because it is valid in js identifiers) but I would not go much further because gain is small and result quickly becomes cryptic.

bjouhier avatar Mar 29 '17 17:03 bjouhier

~ at the end is good, but then ~ at the beginning is no longer needed. I thought some more about it, and I think we can encode using only the unreserved characters of section https://www.ietf.org/rfc/rfc3986.txt, so ALPHA / DIGIT / "-" / "." / "_" / "~".

Here are the rules:

  • all values terminate with ~
  • true, false, null become -T~, -F~, -N~
  • numbers start with - (+ digit) or a digit and end with ~
  • strings start with alpha or * (the only extra non-unreserved character we use) and terminate with ~
    • strings internally get space replaced by _ (common and very readable), * by **, _ by *_, ~ by *-, % by *. and any others we like
    • I don't think we need *XX and *XXXX encoding, that will be done by uriencoding whenever actually needed. Lots of common characters can be replaced by *+single char
    • Empty string is *~
  • objects start with _, arrays start with ., both terminate with ~.
    • object keys are encoded as strings, so no starting * needed, only * escaping is done
    • [1, 2] becomes .1~2~~
    • {"a": "fo%o", "_test": "_hmh~m", "5": [1, true]} becomes _a~fo.o~_test~_hmh-m~5~.1~-T~~~

This way, the ending ~ doubles as the value terminator. Any value can be extracted by reading until the next ~. As a bonus, no value starts with ~ so that can distinguish v1

  • is not actually 100% needed if we want to stay pure, . or - could serve as the escape characters with some adjustments

On Wed, Mar 29, 2017 at 7:06 PM Bruno Jouhier [email protected] wrote:

I was thinking about less invasive changes. I would like to keep the parentheses. If we add a ~ at the end, do we still have a problem with parentheses?

I want the encoded string to be unaltered by encodeURIComponent (this was a strong requirement for v1). This limits the character set to ascii alpha + ascii digits + - _ . ! ~ * ' ( ) (uriUnescaped in https://www.ecma-international.org/ecma-262/5.1/#sec-15.1.3) and I would restrict even further, and eliminate '. This rules out characters like = / |.

So I'm proposing the following changes:

  • add a ~ as the end, to keep the auto-url-detectors happy. This trailing char can also be used to distinguish between v1 and v2
  • replace ' by !, to avoid browser encoding.
  • maybe a few special * escapes. I like *_ for space, maybe *- for $ (frequent in object keys because it is valid in js identifiers) but I would not go much further because gain is small and result quickly becomes cryptic.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Sage/jsurl/issues/16#issuecomment-290156154, or mute the thread https://github.com/notifications/unsubscribe-auth/AADWlv0OZHpG1oEDMPbA0FwfhGH7rn6tks5rqo9WgaJpZM4Mr2qb .

wmertens avatar Mar 31 '17 12:03 wmertens

one more optimization: change repeating final ~ to a single ~, and to grab a value search until ~ or end of string. Then the standard example becomes _name~John_Doe~age~42~children~.Mary~Bill~

wmertens avatar Mar 31 '17 15:03 wmertens

Lots of good ideas here but I want to understand why you want to get rid of parentheses. Lots of URLs have parentheses, and parentheses are a good visual clue for nested substructures.

bjouhier avatar Apr 01 '17 14:04 bjouhier

They are not guaranteed to be left alone, and by making ~ the terminator for everything, parsing is faster…

On Sat, Apr 1, 2017, 4:12 PM Bruno Jouhier [email protected] wrote:

Lots of good ideas here but I want to understand why you want to get rid of parentheses. Lots of URLs have parentheses, and parentheses are a good visual clue for nested substructures.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Sage/jsurl/issues/16#issuecomment-290922554, or mute the thread https://github.com/notifications/unsubscribe-auth/AADWlh0yy53qDfj6Lct9XHbR767G8cNzks5rrltIgaJpZM4Mr2qb .

wmertens avatar Apr 01 '17 14:04 wmertens

(plus auto-url detection works better with ~, and we save a few bytes at the end of the string by merging ˜s)

On Sat, Apr 1, 2017, 4:16 PM Wout Mertens [email protected] wrote:

They are not guaranteed to be left alone, and by making ~ the terminator for everything, parsing is faster…

On Sat, Apr 1, 2017, 4:12 PM Bruno Jouhier [email protected] wrote:

Lots of good ideas here but I want to understand why you want to get rid of parentheses. Lots of URLs have parentheses, and parentheses are a good visual clue for nested substructures.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Sage/jsurl/issues/16#issuecomment-290922554, or mute the thread https://github.com/notifications/unsubscribe-auth/AADWlh0yy53qDfj6Lct9XHbR767G8cNzks5rrltIgaJpZM4Mr2qb .

wmertens avatar Apr 01 '17 14:04 wmertens

More detailed comments:

  • all values terminate with ~ OK
  • true, false, null become -T~, -F~, -N~ OK
  • numbers start with - (+ digit) or a digit and end with ~ OK
  • strings start with alpha or * (the only extra non-unreserved character we use) and terminate with ~ OK
    • strings internally get space replaced by _ (common and very readable), * by **, _ by *_, ~ by *-, % by *. and any others we like OK for space - others need discussion
    • I don't think we need *XX and *XXXX encoding, that will be done by uriencoding whenever actually needed. Lots of common characters can be replaced by *+single char KO - jsurl shouldn't rely on a uriencoding pass
    • Empty string is *~ OK - clever
  • objects start with _, arrays start with ., both terminate with ~. I'd like to keep parens, at least around objects
    • object keys are encoded as strings, so no starting * needed, only * escaping is done OK
      • [1, 2] becomes .1~2~~ **
      • {"a": "fo%o", "_test": "_hm*h~m", "5": [1, true]} becomes _a~fo*.o~*_test~**_hm**h*-m~5~.1~-T~~~

bjouhier avatar Apr 01 '17 14:04 bjouhier

When would parentheses get escaped? They are uriUnescaped (but ' was too) and I have never seen them being escaped.

bjouhier avatar Apr 01 '17 14:04 bjouhier

There is a problem with strings starting with a number. How do you encode "0"?

bjouhier avatar Apr 01 '17 14:04 bjouhier

Well another reason for not using () is that you then need an extra char to start an array and I wanted to minimize byte length. Plus, they are part of the "reserved" set, and most of those get encoded anyway. (so is * but replacing that with - or _ would make things uglier)

"0" becomes *0~.

On Sat, Apr 1, 2017, 4:48 PM Bruno Jouhier [email protected] wrote:

There is a problem with strings starting with a number. How do you encode "0"?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Sage/jsurl/issues/16#issuecomment-290924618, or mute the thread https://github.com/notifications/unsubscribe-auth/AADWllL2z9kL0Vj8Ies2deBiGWaMRXfeks5rrmO_gaJpZM4Mr2qb .

wmertens avatar Apr 01 '17 15:04 wmertens

We could keep ! too. Then I'd rather do the following:

  • true, false, null become T~, F~, N~ (shorter, and leading - felt strange).
  • strings start with !.

*0~ feels like a hack. What about "20"? It cannot be *20~ as this would be space. Is it *2*0~? Will be bad for us because we are passing decimal values as strings to avoid precision pb with js numbers.

bjouhier avatar Apr 01 '17 15:04 bjouhier

Parentheses are not uriReserved, they are uriUnescaped.

bjouhier avatar Apr 01 '17 15:04 bjouhier

So the code works by the fact that at the beginning of a value there are only a number of possible characters. All cases are in the if clauses as https://github.com/wmertens/jsurl/blob/4ffcdea624eb29070bd6c44510e438b46799e986/lib/jsurl2.js#L71 - I tried to optimize for stringified length. So strings only start with * (or ! if they are not unambiguously strings.

Parentheses are in section 2.2 "Reserved Characters" https://tools.ietf.org/html/rfc3986#section-2.2 - although wikipedia says that means they can be used. I must say, if I paste ! $ & ' ( ) * + , ; = in the URL bar in Chrome, only ' gets escaped, and behind a # none get escaped.

How about starting objects with ( but still terminating with ~?

wmertens avatar Apr 01 '17 15:04 wmertens

I must say, I really like the _ for space, it makes embedded spaces easy to read.

As for the URI encoding, I was reasoning thusly:

  • you have no control over URI encoding, and if it happens anyway, why not let the fast native functions do it? It can recover from it in any case.
  • If you let native handle it, then embedded unicode is readable in the address bar
  • It frees up escaped address space for other purposes; I'd rather escape common encoded chars in 2 chars instead of 3.

wmertens avatar Apr 01 '17 15:04 wmertens

Oh and *20~ is "20". If we do our own encoding still it would be **20~. * is only escape inside string values.

On Sat, Apr 1, 2017, 5:12 PM Bruno Jouhier [email protected] wrote:

Parentheses are not uriReserved, they are uriUnescaped.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Sage/jsurl/issues/16#issuecomment-290926118, or mute the thread https://github.com/notifications/unsubscribe-auth/AADWlh03ghzP7TnCu66qZ0S2SnXF4gJNks5rrmlrgaJpZM4Mr2qb .

wmertens avatar Apr 01 '17 15:04 wmertens

And we could omit the leading ! for object keys if the key starts with alpha.

bjouhier avatar Apr 01 '17 15:04 bjouhier

That already happens, object keys are string context so they don't need a string marker…

On Sat, Apr 1, 2017, 5:42 PM Bruno Jouhier [email protected] wrote:

And we could omit the leading ! for object keys if the key starts with alpha.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Sage/jsurl/issues/16#issuecomment-290927966, or mute the thread https://github.com/notifications/unsubscribe-auth/AADWllLGuAyiRZd7VsS4e62CKOl0EhMpks5rrnBlgaJpZM4Mr2qb .

wmertens avatar Apr 01 '17 15:04 wmertens

Point taken about generic URL RFC. I was referring to the specs for JS URL handling functions: https://www.ecma-international.org/ecma-262/5.1/#sec-15.1.3. I care most about the JS functions because that what's JS guys use to encode/decode.

I like _ for embedded space too.

OK for leaving non-ASCII chars as is instead of encoding with **. More compact and more readable.

I'd like to have the closing parenthesis at the end of objects too. The whole point is to trade a bit of compactness (one extra char at the end - wtf) for readability. Without it, it is very difficult to see where the object ends.

I had misunderstood the leading * in strings. I thought that it was the start of an escape sequence.

What about prefixing T, F and N by ! instead of -? I find the ("- followed by digit" vs. "- followed by letter" rule a bit too hacky).

bjouhier avatar Apr 01 '17 15:04 bjouhier

Note: with this, a non empty object looks like (<...>~)~ and a non empty array like .<...>~~. So we have an unambiguous end marker for objects ()~) and arrays (~~).

And then we could use _T, _F and _N because _ is not reserved for object start any more.

bjouhier avatar Apr 01 '17 16:04 bjouhier

Right, and actually you can drop ~ before ), if strings cannot contain ). Then ) is unambiguous and the initial parse split can split on ~ or ). So then there is no byte cost, and the string end can replace all ) and ~ with a single ~ still.

Actually I like !T etc, it doesn't read a

On Sat, Apr 1, 2017, 6:07 PM Bruno Jouhier [email protected] wrote:

Note: with this, a non empty object looks like (<...>~)~ and a non empty array like .<...>~~. So we have an unambiguous end marker for objects ()~) and arrays (~~).

And then we could use _T, _F and _N because _ is not reserved for object start any more.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Sage/jsurl/issues/16#issuecomment-290929523, or mute the thread https://github.com/notifications/unsubscribe-auth/AADWlo-4PfLy3CngN564Gs43PKK_bR7Wks5rrnYrgaJpZM4Mr2qb .

wmertens avatar Apr 01 '17 16:04 wmertens

Summary of revised proposal:

  • all values terminate with ~
  • true, false, null become _T~, _F~, _N~
  • numbers start with - (+ digit) or a digit and end with ~
  • strings start with alpha or * (the only extra non-unreserved character we use) and terminate with ~
    • strings internally get space replaced by _ (common and very readable), * by **, _ by *_, ~ by *-, % by *..
    • I don't think we need *XX and *XXXX encoding, that will be done by uriencoding whenever actually needed.
    • Empty string is *~
  • objects start with ( and end with )~
  • arrays start with ., and end with ~
  • object keys are encoded as strings, so no starting * needed, only * escaping is done - [1, 2] becomes .1~2~~ - {"a": "fo%o", "_test": "_hm*h~m", "5": [1, true]} becomes (a~fo*.o~*_test~**_hm**h*-m~5~.1~_T~~)~

bjouhier avatar Apr 01 '17 16:04 bjouhier

...as a string.

On Sat, Apr 1, 2017, 6:26 PM Wout Mertens [email protected] wrote:

Right, and actually you can drop ~ before ), if strings cannot contain ). Then ) is unambiguous and the initial parse split can split on ~ or ). So then there is no byte cost, and the string end can replace all ) and ~ with a single ~ still.

Actually I like !T etc, it doesn't read a

On Sat, Apr 1, 2017, 6:07 PM Bruno Jouhier [email protected] wrote:

Note: with this, a non empty object looks like (<...>~)~ and a non empty array like .<...>~~. So we have an unambiguous end marker for objects ()~) and arrays (~~).

And then we could use _T, _F and _N because _ is not reserved for object start any more.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Sage/jsurl/issues/16#issuecomment-290929523, or mute the thread https://github.com/notifications/unsubscribe-auth/AADWlo-4PfLy3CngN564Gs43PKK_bR7Wks5rrnYrgaJpZM4Mr2qb .

wmertens avatar Apr 01 '17 16:04 wmertens

What about having arrays start with ~ rather than . and end with ~. As they usually follow another value, it gives them a nice ~~<...>~~ symmetry.

bjouhier avatar Apr 01 '17 16:04 bjouhier

Also, the "force string start" char could be _. Then the final example becomes (a~fo*.o~test~_hm**h*-m~5~.1~!T~

(sorry on mobile)

On Sat, Apr 1, 2017, 6:26 PM Bruno Jouhier [email protected] wrote:

Summary of revised proposal:

  • all values terminate with ~

  • true, false, null become _T~, _F~, _N~

  • numbers start with - (+ digit) or a digit and end with ~

  • strings start with alpha or * (the only extra non-unreserved character we use) and terminate with ~

    • strings internally get space replaced by _ (common and very
    • readable), * by **, _ by *_, ~ by *-, % by *..
    • I don't think we need *XX and *XXXX encoding, that will be done by uriencoding whenever actually needed.
    • Empty string is *~
  • objects start with ( and end with ')~'

  • arrays start with ., and end with ~.

  • object keys are encoded as strings, so no starting * needed, only * escaping is done OK

    • [1, 2] becomes .1~2~~
  • {"a": "fo%o", "_test": "_hmh~m", "5": [1, true]} becomes (a~fo.o~_test~_hmh-m~5~.1~_T~~)~

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Sage/jsurl/issues/16#issuecomment-290930605, or mute the thread https://github.com/notifications/unsubscribe-auth/AADWlvVyXNkJrgrFykeju6FyivGKvVXgks5rrnq6gaJpZM4Mr2qb .

wmertens avatar Apr 01 '17 16:04 wmertens

That can work, it would take the ~ special case for true but that's no biggie

On Sat, Apr 1, 2017, 6:32 PM Wout Mertens [email protected] wrote:

Also, the "force string start" char could be _. Then the final example becomes (a~fo*.o~test~_hm**h*-m~5~.1~!T~

(sorry on mobile)

On Sat, Apr 1, 2017, 6:26 PM Bruno Jouhier [email protected] wrote:

Summary of revised proposal:

  • all values terminate with ~

  • true, false, null become _T~, _F~, _N~

  • numbers start with - (+ digit) or a digit and end with ~

  • strings start with alpha or * (the only extra non-unreserved character we use) and terminate with ~

    • strings internally get space replaced by _ (common and very
    • readable), * by **, _ by *_, ~ by *-, % by *..
    • I don't think we need *XX and *XXXX encoding, that will be done by uriencoding whenever actually needed.
    • Empty string is *~
  • objects start with ( and end with ')~'

  • arrays start with ., and end with ~.

  • object keys are encoded as strings, so no starting * needed, only * escaping is done OK

    • [1, 2] becomes .1~2~~
  • {"a": "fo%o", "_test": "_hmh~m", "5": [1, true]} becomes (a~fo.o~_test~_hmh-m~5~.1~_T~~)~

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Sage/jsurl/issues/16#issuecomment-290930605, or mute the thread https://github.com/notifications/unsubscribe-auth/AADWlvVyXNkJrgrFykeju6FyivGKvVXgks5rrnq6gaJpZM4Mr2qb .

wmertens avatar Apr 01 '17 16:04 wmertens

I too was thinking of dropping the ~ after ). Only gotcha is the url-auto-detector issue that started this whole thing 😄.

bjouhier avatar Apr 01 '17 16:04 bjouhier