JScrewIt icon indicating copy to clipboard operation
JScrewIt copied to clipboard

Shorter encoding using toString

Open 837951602 opened this issue 8 years ago • 10 comments

Let's say we encoding 'hhhhh'. I directly type 'hhhhh' and get 16679 chars. Even if I select 'All new browsers', 6014 chars are generated. However 1889567..toString(18) also be 'hhhhh' and it's length is 3532 chars(All env.) and 1611 chars(new browser),

837951602 avatar Dec 30 '16 04:12 837951602

Thanks for reporting this issue. I'm sure JScrewIt could do much better in merging adjacent characters like in the case you describe.

The reason why I've always shunned this optimization is that it's hard to get it right without trying a potentially very large number of functions, parameters, and most of all start and end points. To summarize:

  • atob, btoa, toString, toUpperCase and unescape can all handle multiple characters at the same time. When in doubt what to choose, try all available options.
  • toString accepts a radix parameter that can make a difference: 16223185..toString(31) evaluates to "hhhhh" and is shorter than 1889567..toString(18).
  • In longer strings, it is not evident what adjacent characters work well when merged together: is "hasty" best written as a toString expression? What about "hash"? Note that the complexity grows quadratically with the length of the string.

As a first step, maybe trying to encode the whole string as a toString expression would be fine. I'll be thinking about this.

fasttime avatar Dec 30 '16 08:12 fasttime

'toString' costs 3241 and numbers are cheap that 999999999999999 costs 705. If there's two high-cost letters near, merging them will always give positive effect. 'g' can be treated as half an high-cost letter. unescape('%aa')+'bb'+unescape('cc') equals to unescape('%aa'+'bb'.replace('%','%25')+'cc'). As numbera are cheap, the first be better only if there're too many '%'s. Low-cost characters may be left away. Won't differ too much.

837951602 avatar Dec 30 '16 13:12 837951602

Thanks a lot for your ideas. It is important to me that new optimizations make encoding not just better, but as good as possible, while preserving linear complexity. JScrewIt can calculate the length of each single character quite efficiently, so there is no need for approximate cost estimations.

Since you are the first person to ask for this long-due optimization, I've decided I'm going to start working on this after the next release, so don't expect anything before version 2.6.0. I'll be posting updates here as the work progresses.

fasttime avatar Dec 30 '16 21:12 fasttime

I've started to work on this issue. The next release branch already contains the logic to encode multiple characters in a toString group. This looks all very promising but there is still a lot to be done.

fasttime avatar Feb 05 '17 15:02 fasttime

11e-13 is +(+!![]+(+(+!![]+[+!![]]+(!![]+[])[!![]+!![]+!![]]+(!![]+!![])+(+[]))+[])[+!![]]+(+!![])+(!![]+[])[!![]+!![]+!![]]+(+((+(+!![]+[+!![]]+(!![]+[])[!![]+!![]+!![]]+(!![]+!![])+(+[]))+[])[+!![]]+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+!![]))+[])[!![]+!![]]+(+!![])+(!![]+!![])) while +'11e-13' is +(+!![]+[+!![]]+(!![]+[])[!![]+!![]+!![]]+(+((+(+!![]+[+!![]]+(!![]+[])[!![]+!![]+!![]]+(!![]+!![])+(+[]))+[])[+!![]]+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+!![]))+[])[!![]+!![]]+(+!![])+(!![]+!![]+!![])) and +'0.0000000000011' is +(+[]+(+(+!![]+[+!![]]+(!![]+[])[!![]+!![]+!![]]+(!![]+!![])+(+[]))+[])[+!![]]+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+!![])+(+!![]))

837951602 avatar Feb 06 '17 14:02 837951602

This happens because 11e-13 is being encoded as +"1.1e-12", in line with its canonical representation. There will be some improvements to numeric encoding in the next patch release, but the case you note still needs some research, particularly because I'm not sure that all engines would handle all edge cases correctly.

fasttime avatar Feb 06 '17 21:02 fasttime

If 1.1e21 can be written into 11e20, why not 1.1e-12 into 11e-13? Maybe a new issue is a better choice if the part of number need deeper talk?

837951602 avatar Feb 07 '17 14:02 837951602

@837951602 Sure, feel free to open a new issue.

fasttime avatar Feb 07 '17 15:02 fasttime

toString optimization is now complete in branch 2.6. Not perfect, but complete. I'll be surely working to improve this and add support for more optimizations like atob clusters in future versions. Also, numeric encoding has become better, and all tests are green, phew.

fasttime avatar Mar 10 '17 12:03 fasttime

I've just completed a new optimizer: the complex optimizer uses the same principle as the toString optimizer, and has now replaced the older complex encoding mechanism. The optimizations are still the same, for example having "Number" expressed as 0..constructor.name when it is convenient.

Although this change is mostly a streamlined new implementation of existing functionality, there are a few exceptional cases where the output length would actually improve (e.g. with input "Stringpool"), and that's because the complex optimizer does not take precedence over other optimizations, unlike the old implementation.

fasttime avatar Apr 15 '17 08:04 fasttime