Using () to delimit objects breaks auto-url-detectors
if you embed a jsurl object result in a url as the last component, you get something like http://example.com/foo?q=~(a~'test), and if you paste that somewhere, there's a good chance that the url up but not including the final ) is recognized.
One option is adding a final ~, that fixes it?
I could implement this in a v2 but the problem is that a string produced by a v2 will fail to parse with a v1 parser. So far I have resisted making changes because I did not want to break protocols that use jsurl.
Well, I respect that, but you can always call the encoding jsurl2 and make it clear there is no compatibility except in spirit…
My usage so far was to encode data for consumption by the same application, and I would guess that that is the major use case…
On Tue, Mar 28, 2017, 9:45 PM Bruno Jouhier [email protected] wrote:
I could implement this in a v2 but the problem is that a string produced by a v2 will fail to parse with a v1 parser. So far I have resisted making changes because I did not want to break protocols that use jsurl.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Sage/jsurl/issues/16#issuecomment-289882744, or mute the thread https://github.com/notifications/unsubscribe-auth/AADWlsJTDVput2un6AQV7qnquSYnravtks5rqWNHgaJpZM4Mr2qb .
Our situation is different because our app has several components that interact with jsurl and it is more difficult to move them all at once (especially as our components are deployed on-premise). So we need to preserve interop.
But I'm not opposed to fixing the issues with a v2. We should solve all the pending issues at once (encoded quote and trailing ~) so that we don't have to move again later.
So, how about changing the initial character for jsurl2? That way, you can parse ~ starting strings as v1 and = (or whatever) as v2
For the (), I realized that as you descend into a JS value, there are only a few possibilities, so if you drop some robustness, you can use any valid character to delimit blocks.
Furthermore, while parsing the inside of a block, you only need 2 characters: one to stop the block and one to go deeper. Normally these are ) and (, but they could also change on every level. So you could delimit the first block with / (will be part of the url even at end) and then alternating with | and / (for example): =/name~"John*20Doe~age~42~children~|~"Mary~"Bill|/
In fact, at each split point of the JSON structures at http://www.json.org/, you can use a different set of encoding characters. The example could also be e.g. =/name~John*_Doe~age~42~childrenMary~Bill~/, or even =/!0~John*_Doe~!1~42~!2Mary~!3~/ (with pre-shared dictionary):
- /, | and * start objects/arrays depending on level (rotate the set on every level, note that * is not needed for escaping here)
- " or any a-zA-Z start a string. " is only needed if a string does not start with alpha
- -, 0-9 and . start a number, so a decimal can be .5
- !/, !|, !* can be true, false and null. That leaves lots of address space in ! to refer to a pre-shared dictionary. Keys starting with ! could also refer to that dictionary.
- inside properties and strings, *_ encodes a space. all of *x is available if *XX requires uppercase. E.g. *! *~ */ *| **
That should make for shorter encodings that still are fairly readable.
For robustness, a short fixed-size checksum could be added to the end, e.g. 2 characters taking the sum of all character values plus the string length, module 64^2, base64 (is that url safe?)
On Wed, Mar 29, 2017 at 11:03 AM Bruno Jouhier [email protected] wrote:
Our situation is different because our app has several components that interact with jsurl and it is more difficult to move them all at once (especially as our components are deployed on-premise). So we need to preserve interop.
But I'm not opposed to fixing the issues with a v2. We should solve all the pending issues at once (encoded quote and trailing ~) so that we don't have to move again later.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Sage/jsurl/issues/16#issuecomment-290028723, or mute the thread https://github.com/notifications/unsubscribe-auth/AADWlntQzv1dlE7GFNp-1MZUfAttNwUEks5rqh5egaJpZM4Mr2qb .
I was thinking about less invasive changes. I would like to keep the parentheses. If we add a ~ at the end, do we still have a problem with parentheses?
I want the encoded string to be unaltered by encodeURIComponent (this was a strong requirement for v1). This limits the character set to ascii alpha + ascii digits + - _ . ! ~ * ' ( ) (uriUnescaped in https://www.ecma-international.org/ecma-262/5.1/#sec-15.1.3) and I would restrict even further, and eliminate '. This rules out characters like = / |.
So I'm proposing the following changes:
- add a
~as the end, to keep the auto-url-detectors happy. This trailing char can also be used to distinguish between v1 and v2 - replace
'by!, to avoid browser encoding. - maybe a few special
*escapes. I like*_for space, maybe*-for$(frequent in object keys because it is valid in js identifiers) but I would not go much further because gain is small and result quickly becomes cryptic.
~ at the end is good, but then ~ at the beginning is no longer needed. I thought some more about it, and I think we can encode using only the unreserved characters of section https://www.ietf.org/rfc/rfc3986.txt, so ALPHA / DIGIT / "-" / "." / "_" / "~".
Here are the rules:
- all values terminate with ~
- true, false, null become -T~, -F~, -N~
- numbers start with - (+ digit) or a digit and end with ~
- strings start with alpha or * (the only extra non-unreserved character
we use) and terminate with ~
- strings internally get space replaced by _ (common and very readable), * by **, _ by *_, ~ by *-, % by *. and any others we like
- I don't think we need *XX and *XXXX encoding, that will be done by uriencoding whenever actually needed. Lots of common characters can be replaced by *+single char
- Empty string is *~
- objects start with _, arrays start with ., both terminate with ~.
- object keys are encoded as strings, so no starting * needed, only * escaping is done
- [1, 2] becomes .1~2~~
- {"a": "fo%o", "_test": "_hmh~m", "5": [1, true]} becomes _a~fo.o~_test~_hmh-m~5~.1~-T~~~
This way, the ending ~ doubles as the value terminator. Any value can be extracted by reading until the next ~. As a bonus, no value starts with ~ so that can distinguish v1
- is not actually 100% needed if we want to stay pure, . or - could serve as the escape characters with some adjustments
On Wed, Mar 29, 2017 at 7:06 PM Bruno Jouhier [email protected] wrote:
I was thinking about less invasive changes. I would like to keep the parentheses. If we add a ~ at the end, do we still have a problem with parentheses?
I want the encoded string to be unaltered by encodeURIComponent (this was a strong requirement for v1). This limits the character set to ascii alpha + ascii digits + - _ . ! ~ * ' ( ) (uriUnescaped in https://www.ecma-international.org/ecma-262/5.1/#sec-15.1.3) and I would restrict even further, and eliminate '. This rules out characters like = / |.
So I'm proposing the following changes:
- add a ~ as the end, to keep the auto-url-detectors happy. This trailing char can also be used to distinguish between v1 and v2
- replace ' by !, to avoid browser encoding.
- maybe a few special * escapes. I like *_ for space, maybe *- for $ (frequent in object keys because it is valid in js identifiers) but I would not go much further because gain is small and result quickly becomes cryptic.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Sage/jsurl/issues/16#issuecomment-290156154, or mute the thread https://github.com/notifications/unsubscribe-auth/AADWlv0OZHpG1oEDMPbA0FwfhGH7rn6tks5rqo9WgaJpZM4Mr2qb .
one more optimization: change repeating final ~ to a single ~, and to grab a value search until ~ or end of string. Then the standard example becomes _name~John_Doe~age~42~children~.Mary~Bill~
Lots of good ideas here but I want to understand why you want to get rid of parentheses. Lots of URLs have parentheses, and parentheses are a good visual clue for nested substructures.
They are not guaranteed to be left alone, and by making ~ the terminator for everything, parsing is faster…
On Sat, Apr 1, 2017, 4:12 PM Bruno Jouhier [email protected] wrote:
Lots of good ideas here but I want to understand why you want to get rid of parentheses. Lots of URLs have parentheses, and parentheses are a good visual clue for nested substructures.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Sage/jsurl/issues/16#issuecomment-290922554, or mute the thread https://github.com/notifications/unsubscribe-auth/AADWlh0yy53qDfj6Lct9XHbR767G8cNzks5rrltIgaJpZM4Mr2qb .
(plus auto-url detection works better with ~, and we save a few bytes at the end of the string by merging ˜s)
On Sat, Apr 1, 2017, 4:16 PM Wout Mertens [email protected] wrote:
They are not guaranteed to be left alone, and by making ~ the terminator for everything, parsing is faster…
On Sat, Apr 1, 2017, 4:12 PM Bruno Jouhier [email protected] wrote:
Lots of good ideas here but I want to understand why you want to get rid of parentheses. Lots of URLs have parentheses, and parentheses are a good visual clue for nested substructures.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Sage/jsurl/issues/16#issuecomment-290922554, or mute the thread https://github.com/notifications/unsubscribe-auth/AADWlh0yy53qDfj6Lct9XHbR767G8cNzks5rrltIgaJpZM4Mr2qb .
More detailed comments:
- all values terminate with
~OK - true, false, null become
-T~,-F~,-N~OK - numbers start with
-(+ digit) or a digit and end with~OK - strings start with alpha or
*(the only extra non-unreserved character we use) and terminate with~OK- strings internally get space replaced by
_(common and very readable),*by**,_by*_,~by*-,%by*.and any others we like OK for space - others need discussion - I don't think we need
*XXand*XXXXencoding, that will be done by uriencoding whenever actually needed. Lots of common characters can be replaced by*+single char KO - jsurl shouldn't rely on a uriencoding pass - Empty string is
*~OK - clever
- strings internally get space replaced by
- objects start with
_, arrays start with., both terminate with~. I'd like to keep parens, at least around objects- object keys are encoded as strings, so no starting * needed, only * escaping is done OK
[1, 2]becomes.1~2~~ **{"a": "fo%o", "_test": "_hm*h~m", "5": [1, true]}becomes_a~fo*.o~*_test~**_hm**h*-m~5~.1~-T~~~
- object keys are encoded as strings, so no starting * needed, only * escaping is done OK
When would parentheses get escaped? They are uriUnescaped (but ' was too) and I have never seen them being escaped.
There is a problem with strings starting with a number. How do you encode "0"?
Well another reason for not using () is that you then need an extra char to start an array and I wanted to minimize byte length. Plus, they are part of the "reserved" set, and most of those get encoded anyway. (so is * but replacing that with - or _ would make things uglier)
"0" becomes *0~.
On Sat, Apr 1, 2017, 4:48 PM Bruno Jouhier [email protected] wrote:
There is a problem with strings starting with a number. How do you encode "0"?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Sage/jsurl/issues/16#issuecomment-290924618, or mute the thread https://github.com/notifications/unsubscribe-auth/AADWllL2z9kL0Vj8Ies2deBiGWaMRXfeks5rrmO_gaJpZM4Mr2qb .
We could keep ! too. Then I'd rather do the following:
- true, false, null become
T~,F~,N~(shorter, and leading-felt strange). - strings start with
!.
*0~ feels like a hack. What about "20"? It cannot be *20~ as this would be space. Is it *2*0~? Will be bad for us because we are passing decimal values as strings to avoid precision pb with js numbers.
Parentheses are not uriReserved, they are uriUnescaped.
So the code works by the fact that at the beginning of a value there are only a number of possible characters. All cases are in the if clauses as https://github.com/wmertens/jsurl/blob/4ffcdea624eb29070bd6c44510e438b46799e986/lib/jsurl2.js#L71 - I tried to optimize for stringified length. So strings only start with * (or ! if they are not unambiguously strings.
Parentheses are in section 2.2 "Reserved Characters" https://tools.ietf.org/html/rfc3986#section-2.2 - although wikipedia says that means they can be used. I must say, if I paste ! $ & ' ( ) * + , ; = in the URL bar in Chrome, only ' gets escaped, and behind a # none get escaped.
How about starting objects with ( but still terminating with ~?
I must say, I really like the _ for space, it makes embedded spaces easy to read.
As for the URI encoding, I was reasoning thusly:
- you have no control over URI encoding, and if it happens anyway, why not let the fast native functions do it? It can recover from it in any case.
- If you let native handle it, then embedded unicode is readable in the address bar
- It frees up escaped address space for other purposes; I'd rather escape common encoded chars in 2 chars instead of 3.
Oh and *20~ is "20". If we do our own encoding still it would be **20~. * is only escape inside string values.
On Sat, Apr 1, 2017, 5:12 PM Bruno Jouhier [email protected] wrote:
Parentheses are not uriReserved, they are uriUnescaped.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Sage/jsurl/issues/16#issuecomment-290926118, or mute the thread https://github.com/notifications/unsubscribe-auth/AADWlh03ghzP7TnCu66qZ0S2SnXF4gJNks5rrmlrgaJpZM4Mr2qb .
And we could omit the leading ! for object keys if the key starts with alpha.
That already happens, object keys are string context so they don't need a string marker…
On Sat, Apr 1, 2017, 5:42 PM Bruno Jouhier [email protected] wrote:
And we could omit the leading ! for object keys if the key starts with alpha.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Sage/jsurl/issues/16#issuecomment-290927966, or mute the thread https://github.com/notifications/unsubscribe-auth/AADWllLGuAyiRZd7VsS4e62CKOl0EhMpks5rrnBlgaJpZM4Mr2qb .
Point taken about generic URL RFC. I was referring to the specs for JS URL handling functions: https://www.ecma-international.org/ecma-262/5.1/#sec-15.1.3. I care most about the JS functions because that what's JS guys use to encode/decode.
I like _ for embedded space too.
OK for leaving non-ASCII chars as is instead of encoding with **. More compact and more readable.
I'd like to have the closing parenthesis at the end of objects too. The whole point is to trade a bit of compactness (one extra char at the end - wtf) for readability. Without it, it is very difficult to see where the object ends.
I had misunderstood the leading * in strings. I thought that it was the start of an escape sequence.
What about prefixing T, F and N by ! instead of -? I find the ("- followed by digit" vs. "- followed by letter" rule a bit too hacky).
Note: with this, a non empty object looks like (<...>~)~ and a non empty array like .<...>~~. So we have an unambiguous end marker for objects ()~) and arrays (~~).
And then we could use _T, _F and _N because _ is not reserved for object start any more.
Right, and actually you can drop ~ before ), if strings cannot contain ). Then ) is unambiguous and the initial parse split can split on ~ or ). So then there is no byte cost, and the string end can replace all ) and ~ with a single ~ still.
Actually I like !T etc, it doesn't read a
On Sat, Apr 1, 2017, 6:07 PM Bruno Jouhier [email protected] wrote:
Note: with this, a non empty object looks like (<...>~)~ and a non empty array like .<...>~~. So we have an unambiguous end marker for objects ()~) and arrays (~~).
And then we could use _T, _F and _N because _ is not reserved for object start any more.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Sage/jsurl/issues/16#issuecomment-290929523, or mute the thread https://github.com/notifications/unsubscribe-auth/AADWlo-4PfLy3CngN564Gs43PKK_bR7Wks5rrnYrgaJpZM4Mr2qb .
Summary of revised proposal:
- all values terminate with
~ - true, false, null become
_T~,_F~,_N~ - numbers start with
-(+ digit) or a digit and end with~ - strings start with alpha or
*(the only extra non-unreserved character we use) and terminate with~- strings internally get space replaced by
_(common and very readable),*by**,_by*_,~by*-,%by*.. - I don't think we need
*XXand*XXXXencoding, that will be done by uriencoding whenever actually needed. - Empty string is
*~
- strings internally get space replaced by
- objects start with
(and end with)~ - arrays start with
., and end with~ - object keys are encoded as strings, so no starting
*needed, only*escaping is done -[1, 2]becomes.1~2~~-{"a": "fo%o", "_test": "_hm*h~m", "5": [1, true]}becomes(a~fo*.o~*_test~**_hm**h*-m~5~.1~_T~~)~
...as a string.
On Sat, Apr 1, 2017, 6:26 PM Wout Mertens [email protected] wrote:
Right, and actually you can drop ~ before ), if strings cannot contain ). Then ) is unambiguous and the initial parse split can split on ~ or ). So then there is no byte cost, and the string end can replace all ) and ~ with a single ~ still.
Actually I like !T etc, it doesn't read a
On Sat, Apr 1, 2017, 6:07 PM Bruno Jouhier [email protected] wrote:
Note: with this, a non empty object looks like (<...>~)~ and a non empty array like .<...>~~. So we have an unambiguous end marker for objects ()~) and arrays (~~).
And then we could use _T, _F and _N because _ is not reserved for object start any more.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Sage/jsurl/issues/16#issuecomment-290929523, or mute the thread https://github.com/notifications/unsubscribe-auth/AADWlo-4PfLy3CngN564Gs43PKK_bR7Wks5rrnYrgaJpZM4Mr2qb .
What about having arrays start with ~ rather than . and end with ~. As they usually follow another value, it gives them a nice ~~<...>~~ symmetry.
Also, the "force string start" char could be _. Then the final example becomes (a~fo*.o~test~_hm**h*-m~5~.1~!T~
(sorry on mobile)
On Sat, Apr 1, 2017, 6:26 PM Bruno Jouhier [email protected] wrote:
Summary of revised proposal:
all values terminate with ~
true, false, null become _T~, _F~, _N~
numbers start with - (+ digit) or a digit and end with ~
strings start with alpha or * (the only extra non-unreserved character we use) and terminate with ~
- strings internally get space replaced by _ (common and very
- readable), * by **, _ by *_, ~ by *-, % by *..
- I don't think we need *XX and *XXXX encoding, that will be done by uriencoding whenever actually needed.
- Empty string is *~
objects start with ( and end with ')~'
arrays start with ., and end with ~.
object keys are encoded as strings, so no starting * needed, only * escaping is done OK
- [1, 2] becomes .1~2~~
{"a": "fo%o", "_test": "_hmh~m", "5": [1, true]} becomes (a~fo.o~_test~_hmh-m~5~.1~_T~~)~
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Sage/jsurl/issues/16#issuecomment-290930605, or mute the thread https://github.com/notifications/unsubscribe-auth/AADWlvVyXNkJrgrFykeju6FyivGKvVXgks5rrnq6gaJpZM4Mr2qb .
That can work, it would take the ~ special case for true but that's no biggie
On Sat, Apr 1, 2017, 6:32 PM Wout Mertens [email protected] wrote:
Also, the "force string start" char could be _. Then the final example becomes (a~fo*.o~test~_hm**h*-m~5~.1~!T~
(sorry on mobile)
On Sat, Apr 1, 2017, 6:26 PM Bruno Jouhier [email protected] wrote:
Summary of revised proposal:
all values terminate with ~
true, false, null become _T~, _F~, _N~
numbers start with - (+ digit) or a digit and end with ~
strings start with alpha or * (the only extra non-unreserved character we use) and terminate with ~
- strings internally get space replaced by _ (common and very
- readable), * by **, _ by *_, ~ by *-, % by *..
- I don't think we need *XX and *XXXX encoding, that will be done by uriencoding whenever actually needed.
- Empty string is *~
objects start with ( and end with ')~'
arrays start with ., and end with ~.
object keys are encoded as strings, so no starting * needed, only * escaping is done OK
- [1, 2] becomes .1~2~~
{"a": "fo%o", "_test": "_hmh~m", "5": [1, true]} becomes (a~fo.o~_test~_hmh-m~5~.1~_T~~)~
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Sage/jsurl/issues/16#issuecomment-290930605, or mute the thread https://github.com/notifications/unsubscribe-auth/AADWlvVyXNkJrgrFykeju6FyivGKvVXgks5rrnq6gaJpZM4Mr2qb .
I too was thinking of dropping the ~ after ). Only gotcha is the url-auto-detector issue that started this whole thing 😄.