lz-string icon indicating copy to clipboard operation
lz-string copied to clipboard

A cornucopia of optimisations

Open JobLeonard opened this issue 7 years ago • 89 comments

This combines various suggestions from @Rycochet and @lishid and myself. See #88 and #97.

  • replace string-based keyStr.charAt(i) with array-based keyStr[i]
  • add compressToArray and decompressFromArray functions, and use them throughout the code to avoid intermediate string creation:
    • compressToBase64 avoids appending to string
    • compressToUint8Array avoids an intermediate string
    • decompressFromUint8Array avoids an intermediate string

Similar optimisations have been applied to base64-string.js, but it doesn't have tests so I can't say for sure if it's bug-free.

JobLeonard avatar Jun 26 '17 19:06 JobLeonard

Some benchmarks:

plain compress/decompresss: http://jsbench.github.io/#2a07fd14b55d44291da4b06d3ba6e5c3

base64: http://jsbench.github.io/#da32b5d1100c24c7744bb58ca3fff440

UTF16: http://jsbench.github.io/#4029cae03b1e2fc06ad44e35f5bfca6b

Uri: http://jsbench.github.io/#54c40822dabdbd46a93fb0b7ff6832d9

Uint8: http://jsbench.github.io/#14395aeb1f452cefa52b5f86b0f644c5

I'd say performance is.. probably a bit better, but also almost negligible within the margin or error.

Given the changes to the code, there must be less allocation going on, so that must have improved a bit at least.

I had them all in one big benchmark, but then I tried to profile them using DevTools. Those tools then suggested that compress and decompress were deoptimised! The reason, I think, is because I used all functions, leading to a different closure being passed each time, leading to de-opting the compress function so often that V8 was like "screw this!". So I split it into different benchmarks hoping to mitigate that, since I figure real-life usage doesn't usually use all functions, and because it's unclear when the de-opt hits, making the benchmark less reliable.

Please check if the input data is relevant - I basically took the test string (the one about tattoo's) and tripled it, causing a lot of redundancy.

JobLeonard avatar Jun 27 '17 14:06 JobLeonard

I was recently optimising some startup code in an app, and found that the GC could actually have significant impact on things (up to 200ms for 16mb) - but as it's not something easily forced it's hard to get consistent profiling to include that unless running a significantly longer test including setImmediate() style callbacks and known consistent memory allocation and freeing.

TL/DR: Reducing the need for GC is always good ;-)

Rycochet avatar Jun 27 '17 16:06 Rycochet

The main issue is that it's not easy to see if it's significant in the larger scheme of things - perhaps all computation and memory allocation happens in the core algorithm that I didn't touch.

I think I should set up a plain website with this code, run it through both functions (let's just use the plain compress/decompress as a baseline) with a yuuuge string repeatedly, while running the browser performance tools to keep track of allocations and such. That might give a better overview.

JobLeonard avatar Jun 27 '17 16:06 JobLeonard

Deoptimization is triggered by various patterns (some links: Deoptimization in V8, Optimization-killers). In your tests, compress invocations supposedly use the same parameter signature/types so the actual problem is within the function itself. In my experience, even if there are no obvious deopt triggers in the code, it may be caused by the sheer size of the function and splitting solved that for me. Preventing deoptimization is very important as its impact often/usually negates any perf gains, but unfortunately I'm not an expert either.

tophf avatar Jun 27 '17 16:06 tophf

Hmm, time to dive into this a little deeper and see if we can fix that. Still on sick leave anyway so this is a fun side-track :)

JobLeonard avatar Jun 27 '17 17:06 JobLeonard

The thing I was referring to is that compress is passed a function, which itself is usually a closure. I suspect that level of dynamism prevents some deeper optimisations.

This is pretty old, so probably somewhat outdated, but I'll look into it later:

http://mrale.ph/blog/2012/09/23/grokking-v8-closures-for-fun.html

JobLeonard avatar Jun 27 '17 17:06 JobLeonard

Ah, indeed, no deopt on any of the functions if I test in devtools manually via copypaste, a primitive 100-rep loop, and console.time. The difference is negligible just the same. FWIW when switching to ES6 Map and Set in lz-string I clearly see the advertised 2x gain.

tophf avatar Jun 27 '17 17:06 tophf

Related to that, I just realised there's another optimisation possible for Base64 and URI-safe compressors: don't use charAt, use charCodeAt. The reason is that looking up integers in hashmaps is much faster than strings. I have some test code below:


// SETUP
var base64 = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=";

var charDict = (function (){
    var dict = {};
    for (var i = 0; i < base64.length; i++){
        dict[base64.charAt(i)] = i;
    }
    return dict;
})();

var charCodeDict = (function () {
    var dict = {};
    for (var i = 0; i < base64.length; i++){
        dict[base64.charCodeAt(i)] = i;
    }
    return dict;
})();

var randomString = (function (){
    var i = 100000, strArray = new Array(i);
    while(i--){
        strArray[i] = base64.charAt(Math.random(base64.length)|0);
    }
    return strArray.join('');
})();

var converted = 0;

//BENCHMARKS

// charDict
for (var i = 0; i < randomString.length; i++){
    converted = charDict[randomString.charAt(i)];
}

// charCodeDict
for (var i = 0; i < randomString.length; i++){
    converted = charCodeDict[randomString.charCodeAt(i)];
}

http://jsbench.github.io/#a5c234621b81cd41b26e31d7f92f62d4

On my machines integer key lookups are almost 3x faster.

For Base64 and URI this is a drop-in replacement because charAt is only used as an intermediate dictionary lookup.

I'll look up the LZW algorithm, read the current implementation of it, and see if there are more places where we can use charCodeAt instead of charAt in this way.

JobLeonard avatar Jun 28 '17 16:06 JobLeonard

@tophf: can you show me what the faster Map and Set code looks like? Because when I added Map to my previous benchmarks it turned out much slower than even plain object lookup there:

Anyway, small correction: for Basae64 and URI decompression this is a drop-in replacement.

So looking into this a bit further, there are two different ways we can apply this change:

Returning an array of charCodes instead of chars

This might be an optimisation, it might not be. Essentially, all we need to know is whether String.fromCharCode.apply(null, charCodeArray)) is significantly slower than charArray.join('').

If not, we can apply this optimisation safely, essentially replacing my compressToArray code with compressToCharCodeArray. Also, this would simplify compressToUint8Array even further:

  compressToUint8Array: function (uncompressed) {
    //var compressed = LZString.compressToArray(uncompressed);
    var compressed = LZString.compressToCharCodeArray(uncompressed);
    var buf=new Uint8Array(compressed.length*2); // 2 bytes per character

    for (var i=0, TotalLen=compressed.length; i<TotalLen; i++) {
      //var current_value = compressed[i].charCodeAt(0);
      var current_value = compressed[i];
      buf[i*2] = current_value >>> 8;
      buf[i*2+1] = current_value % 256;
    }
    return buf;
  },

I'll write a benchmark to test this later, and if it does not immediately make me think it's worse I'll implement it, and benchmark it again.

Replace the LZW dictionary with a trie

Oh goodness, if this works like I think it works, this could be big!

This applies to _compress, or _compressToArray in my branch.

So this requires a rudimentary understanding of the actual LZW algorithm. I'm using this pseudocode explanation as a guide:

https://www.cs.duke.edu/csed/curious/compression/lzw.html

Basically, LZW is about building a symbol table of prefixes (a JS object), and replacing those prefixes with smaller values in the output stream (an array of integers that then is converted into chars). In a "skipping the crucial-but-currently-not-relevant details" way:

  • scan through the input string
  • keep track of the current prefix through context_w, which starts as an empty string
  • store the current character as context_c
  • Object.prototype.hasOwnProperty.call(context_dictionary,context_w + context_c)?
    • true
      1. context_w = context_w + context_c
      2. continue
    • false
      1. write the value associated with key context_w to the output array
      2. add context_w + context_c as a new prefix to the symbol table
      3. context_w = context_c
      4. continue

I'm skipping a lot stuff - there's a reason the _compress function is over 200 lines of code! But this optimisation doesn't apply to it.

So charAt comes into play at the context_c level. Can we replace it with charCodeAt? Yes, I think we can! To do so, we replace the flat dictionary object with a trie which has nodes that use charCodes instead of chars as keys:

  • scan through the input string
  • keep track of the current prefix through context_node, which starts at the root of the dictionary, initialised as {}
  • store the current charCode as context_c
  • context_node[context_c] !== undefined?
    • true
      1. context_node = context_node[context_c]
      2. continue
    • false
      1. write context_node[-1] to the output array
      2. context_node[context_c] = { val: <new prefix val> }
      3. context_node = root
      4. continue

So each node will just consist of numerical charCode keys, plus one -1 key¹ to store the actual integer value of the prefix.

The gain here consists of two things:

  • no string appending, so we should have no unnecessary object allocation.
  • integer keys instead of string keys, which as the benchmarks I linked above demonstrate are much faster.

And it gets better! Right now there are a lot of calls to Object.prototype.hasOwnProperty.call(). I presume this is to avoid the risk of accidentally matching one of the keys of Object.prototype:

image

In the new version we'll never use strings for our keys, so there will never be any such collisions! Which means we can safely remove all of these calls and just use object[charCodeAt] !== undefined instead, which, as you can see in the second benchmark I linked at the top, blows everything else out of the water!

¹ If we use a string key we lose all benefits of using only numerical lookups - I tested this in benchmarks. Since charCodeAt is guaranteed to return a number between 0 and 65535 that means we can use either -1 or 65336. Using 65336 makes the hasing very slow, for whatever reason, but -1 has no negative effect.

Anyway, just writing this down took quite a bit of energy, but now I have an idea of what to do. I'll start working on this tomorrow and see if it turns out as good as I hope it will!

JobLeonard avatar Jun 28 '17 20:06 JobLeonard

can you show me what the faster Map and Set code looks like?

https://gist.github.com/tophf/e8962e43efe35233212cf04d8d7cd317

2x speedup compared to the nonminified original version as tested on compressToUTF16/decompressFromUTF16 in modern versions of Chrome. Didn't test other functions nor other browsers. The measurements were performed in real code that compressed a megabyte or so of HTML.

Essentially, all we need to know is whether String.fromCharCode.apply(null, charCodeArray)) is significantly slower than charArray.join('').

The stack is still used for the arguments passed via .apply so anything larger than 32k is a risk depending on the browser. And even that is a risk so I usually do in 1k chunks, then join them.

tophf avatar Jun 29 '17 04:06 tophf

Thanks, pretty straightforward.

I guess the reason they're not faster in my benchmarks is because I'm limiting myself to 64 single-character keys. Still, the charCodeAt() optimisations are kind of orthogonal to Map and Set so we could even apply both if necessary!

(I wonder @pieroxy is on holiday or something, and will come back thinking "I look away for one second and then this happens?")

JobLeonard avatar Jun 29 '17 10:06 JobLeonard

The stack is still used for the arguments passed via .apply so anything larger than 32k is a risk depending on the browser. And even that is a risk so I usually do in 1k chunks, then join them.

Ooof... that sounds like performance would fluctuate wildly among browsers and hardware. I'll focus on the other possible optimisations first.

JobLeonard avatar Jun 29 '17 12:06 JobLeonard

Some more thoughts on that de-opting:

let factory = function(){
    return function(){
        return 0;
    }
};
let a = factory();
let b = factory();
let c = a;

a === b; //false
a === c; //true

Right now we pass new functions on every call to any compressor/decompressor. If we hoist that we can maybe make things easier for the JIT compilers. However, current set-up requires it because it's getCharFromInt and getNextValue close over the passed input or compressed strings:

  //decompress from uint8array (UCS-2 big endian format)
  decompressFromUint8Array:function (compressed) {
    if (compressed===null || compressed===undefined){
        return LZString.decompressFromArray(compressed);
    } else if (compressed.length == 0){
      return null;
    }
    return LZString._decompress(compressed.length, 128, function (index) { return compressed[index]; });
  },

That would require a bit of rewriting too.

JobLeonard avatar Jun 29 '17 16:06 JobLeonard

First of all thanks for all the work here. I am eager to see the end result :-)

Just as a side note, be careful when you advertise a x2 increase. It may be so in your browser (lets say Chrome) but then the same code might actually be slower in I.E. or Firefox. When I did a pass at perf optimisations back in the days I probably created 25 jsperf benchmarks and had many surprises on that front.

That said, all this looks promising.

pieroxy avatar Jun 30 '17 06:06 pieroxy

Just as a side note, be careful when you advertise a x2 increase.

Right, I should have been more clear that I when I talked about 2x/3x speed-ups was only referring to the micro-benchmarks on (effectively) string vs integer lookup, which is known to be a relatively recent optimisation in browser engines. It doesn't reflect on the whole algorithm or performance across all browsers.

OTOH, the expected lower memory overhead of removing string concatenation (context_wc = context_w + contextc) from the core algorithm is probably a bigger deal for old browsers, since browsers didn't really optimise internal string representation until a few years back. So the impact of that in terms of reduced memory overhead should be bigger for older browsers.

At the moment I'm still trying to decipher what you do in the algorithm - specifically lines 137 - 204. It's barely commented, so I only have your high-level description from the main page to go by

  • I initialize the dictionary with three tokens:
    • An entry that produces a 16-bit token.
    • An entry that produces an 8-bit token, because most of what I will store is in the iso-latin-1 space, meaning tokens below 256.
    • An entry that mark the end of the stream.
  • The output is processed by a bit stream that stores effectively 16 bits per character in the output string.
  • Each token is stored with just as many bits that are needed according to the size of the dictionary. Hence, the first token takes 2 bits, the second to 7th three bits, etc....

(once it clicks for me I'm going to add a modified version of the paragraph below that a documentation comments in the code for later maintainers)

JobLeonard avatar Jun 30 '17 10:06 JobLeonard

At least the bugs are funny:

image

JobLeonard avatar Jun 30 '17 20:06 JobLeonard

So I figured out a simple way to do in-browser line-by-line performance testing: inline lz-string.js into SpecRunner.html, open the latter, then use the browser dev tool to measure performance for x refreshes. Important: the numbers in the screenshots should be compared only relative to the other numbers in the same screenshot, because I'm not refreshing exactly the same number of times!***

First bottleneck I found: -1 and -2 indexing is slow after all:

image

Then I realised "Hey, why not just use a +2 offset on charCode for the lookup?"

image

Again: I didn't refresh the same number of times, so you can't compare screenshots directly. Now, new_node[0] = dictSize++ might look like it's still pretty high, but I think that's the creation of the bucket for the hashtable, which explains why the line after is so fast in comparison.

So then I tried the latest version vs the old version on JSBench, with the profiler on to measure memory use, and using a tripled 'During tattooing...' test string from the tests. Can you spot the point where it goes from old to new in graph?

image

Anyway, moving on to the _decompressor next, and then once that is done, check if we can simplify the code (I think going from a flat dictionary to a trie might lead to mismatches in "code quirkiness")

JobLeonard avatar Jul 01 '17 10:07 JobLeonard

So just to add to pieroxy's remark that this is complicated, I did some in-between benchmarking on Chrome and Firefox. I'm using all the test strings except Hello World, so:

  • tattoo description (real world text)
  • 1000 random floating point nrs (long string with some repetition)
  • 'aaaaabaaaaacaaaaadaaaaaeaaaaa' (short string with lots of repetition)
  • 'aaaaabaaaaacaaaaadaaaaaeaaaaa' times 1024 (a long string with lots of repetition)
  • all printable UTF16 (represents a worst case: long string, no repetition)

http://jsbench.github.io/#e25c445d987295b0114407e457dde9ad

EDIT: Split out the short string case, because it was distorting the graph by being one to two orders of magnitude faster than the rest http://jsbench.github.io/#4a2ca3e9e3e9ee544f5b76acc7699ef8

On Chrome, the picture is complicated: the new code is slower in quite a few situations, but for long strings with some repetition (1000 random floats) or lots of repetition ('aaa..') it's faster.

On Firefox the picture is a lot rosier, not to mention generally a lot more performant: it's always better, and gets better as strings have more repetition and get longer.

The UTF16 example is actually really revealing, since it is mostly about overhead of filling up a dictionary.

I really hope I can optimise it further so it has either zero regressions in Chrome, or only in insignificant contexts (like really short strings).

image

image

EDIT: For mobile, the improvement is a lot clearer:

Chrome Mobile for Android, everything is faster: 2017-07-01 11 53 50 2017-07-01 11 48 25

(I'm using an autostitcher to put these screenshots together, hence some of the weirdness)

On Firefox Mobile for Android, short strings are slower, everything else is faster: 2017-07-01 11 43 01

JobLeonard avatar Jul 01 '17 11:07 JobLeonard

I hope my long comments don't feel like spam - I code better if I write out my thought processes, plus I think it helps clarify what I'm changing and why. Apologies for turning this into a micro-blog :P

TL;DR: No performance improvements, but I shaved off over half a KB from the minified file

So the previously discussed optimisations don't apply to decompress. In fact, I don't really see any obvious ways to get it more efficient: it already uses arrays and everything (see profiling below).

However, I managed to get rid of a whole lot of switch statements and even a bunch of variables altogether by applying bit twiddling magic. Frustratingly, the benchmarks all have an error margin of 25% to 50% so it's really impossible to tell what effect it has on performance. It shouldn't be any slower - shifts instead of multiplies and such (again, see below).

However, there is quite a bit of prefix/postfic increment/decrement abuse going on in the new code. So I'll add explanations as a comment later. It allows for better minification though: comparing the output from JSCompress, the function body goes down from 1304 to 578 characters (in fact, compress is a bit shorter too - if I throw the whole thing through a minifier the code is 4023 characters, down from 4823)

An example of the changes:

    // BEFORE
    switch (next = bits) {
      case 0:
          bits = 0;
          maxpower = Math.pow(2,8);
          power=1;
          while (power!=maxpower) {
            resb = data.val & data.position;
            data.position >>= 1;
            if (data.position == 0) {
              data.position = resetValue;
              data.val = getNextValue(data.index++);
            }
            bits |= (resb>0 ? 1 : 0) * power;
            power <<= 1;
          }
        c = f(bits);
        break;
      case 1:
          bits = 0;
          maxpower = Math.pow(2,16);
          power=1;
          while (power!=maxpower) {
            resb = data.val & data.position;
            data.position >>= 1;
            if (data.position == 0) {
              data.position = resetValue;
              data.val = getNextValue(data.index++);
            }
            bits |= (resb>0 ? 1 : 0) * power;
            power <<= 1;
          }
        c = f(bits);
        break;
      case 2:
        return "";
    }
    dictionary[3] = c;
    w = c;
    result.push(c);
      // AFTER

      // done once at the top of compress 
      resetValue = Math.log2(resetValue)

      // 0 or 1 implies new character token
      if ((bits&1) == bits){
        maxpower = (8+8*bits);
        bits = power=0;
        while (power!=maxpower) {
          //TODO: Annotate this bit of black magic
          //TL;DR: read out reversed bit-stream, chopped up into chunks of
          // resetvalue-size, but which actually has adaptive-sized tokens.
          // The combination of the two is why reading out is so complicated.
          bits += ((data_val >> --data_position)&1) << power++;
          if (data_position == 0) {
            data_position = resetValue;
            data_val = getNextValue(data_index++);
          }
        }
        dictionary[dictSize] = f(bits);
        bits = dictSize++;
        if (--enlargeIn == 0) {
          enlargeIn = Math.pow(2, numBits++);
        }
      } else if (bits == 2){
        // end of stream token
        return result.join('');
      }

Profiling bottlenecks

I did line level profiling in Chromium, using the aforementioned trick of using the spec (Firefox sadly does not have line level profiling yet AFAIK). I used the raw short repetitive string test and the raw long list of random numbers test, to examine two extremes:

aaaaabaaaaacaaaaadaaaaaeaaaaa:

short string decompress bottleneck

1000 random floating point nrs:

long random nr string bottleneck

As you can see, for short strings creating new chars and adding new tokens dominates. For long strings this goes away. In either case, what the profiling suggests is that our main bottlenecks at this point are:

  • additions to the dictionary (unavoidable)
  • for strings with many unique chars compared to length: bits-to-char conversion (unavoidable)
  • branch mispredictions (maybe we can do a little bit about that, but I doubt it)

My thoughts on that last point: the only way to improve this is loop unrolling. However, that is really complicated here, because we're reading in bits from a bit-stream, in chunks of resetValue size, one call to getNextValue() at a time, and then processing those bits by numBits at a time (which grows as we scan in more characters and prefixes).

        while (power!=maxpower) {
          bits += ((data_val >> --data_position)&1) << power++;
          if (data_position == 0) {
            data_position = resetValue;
            data_val = getNextValue(data_index++);
          }
        }

One thing we could try is calling getNextValue() multiple times and shifting it into data_val (at most, getnextvalue returns 16 bits worth of data, so we can always shift two values in). But because of these out of sync overlapping boundaries of resetValue and numBits this would probably get very complicated (which is probably what messes up the branch predictor in the first place).

JobLeonard avatar Jul 02 '17 15:07 JobLeonard

So I've looked at compress again, also Line Profiling it, using aaaaabaaaaacaaaaadaaaaaeaaaaa as the test string. I'd like to zoom in on one particular part (for completeness, here is the full output)

image

image

Dictionary assignment is biggest bottleneck (not very surprisingly). Some of that is unavoidable, but we might be able to tweak it. Here's what doesn't work: packing node[0] and node[1] into an array or object:

            // pack it in an array
            new_node[0] = [dictSize++, c];

            // pack it in an object
            new_node[0] = {v: dictSize++, c: c};

I've already tried it (thinking "maybe the problem is that I'm assigning both numbers and objects to the dictionary"), and it's just as slow or slower. So I think we're stuck with that.

That leaves the other dictionary: dictionaryToCreate. I think we can actually get rid of that one altogether and replace it with a boolean if we re-organise the order of the algorithm. Right now, what happens is:

// IN LOOP

  • is dictionary[charCode] defined?
    • false: add it and set dictionaryToCreate[charCode] to true
  • does node[charCode] exist?
    • true: node = node[charCode]
    • false:
      • Is dictionaryToCreate[node[1]] (the charcode in node) defined?
        • yes? write out charCode token to output stream
        • no? write out node[0] to output stream
      • add new node to trie at current node
      • set current node to dictionary[charCode]

// AFTER LOOP, clean up last node:

  • does node[charCode] exist?
    • true: node = node[charCode]
    • false:
      • Is dictionaryToCreate[node[1]] (the charcode in node) defined?
        • true: write out charCode token to output stream
        • false: write out node[0] to output stream
      • add new node to trie at current node
      • set current node to dictionary[charCode]

But I think that should be reorganisable to:

// BEFORE LOOP

  • inputString.length > 0?
    • true:
      • write out new charCode token to output stream
      • add charCode node to dictionary
      • rootNode = true

// IN LOOP, START AT INDEX 1

  • does node[charCode] exist?
    • true: node = node[charCode]
    • false:
      • rootNode?
        • true: rootNode = false (don't insert token - already added charCode token)
        • false: write out node[0] to output stream
      • is dictionary[charCode] defined?
        • false:
          • write out new charCode token to output stream
          • add charCode node to dictionary
          • set rootNode to true
      • add new node to trie at current node, that is: node[charCode] = { 0: dictionary++ }
      • set current node to dictionary[charCode]

Apologies for the weird pseudo-code, I'm trying to make sense of a pretty tricky algorithm.

Aside from replacing dictionaryToCreate with rootNode, we also no longer have to store the charCode in any nodes (I don't really see this have any significant effect IRL, but still).

I'm probably missing some subtleties, but I'll figure it out as I go.

JobLeonard avatar Jul 02 '17 21:07 JobLeonard

Goodness, that was hard... I mean, about as hard as I expected, but still. In the end the trickiest part was expanding the number of bits at the right pace, had to write some gnarly console.log code keeping track of the bitstream to decipher that:

image

Anyway, here are the new stats, running on my Lenovo T440s, Kubuntu 17.04.

Chrome:

screenshot_20170704_143948 screenshot_20170704_143849

No matter how hard I try, I can't beat the old short-string performance :/. Same with the all UTF16-test. However, I hope you'll agree it's worth losing a bit of performance on those extreme ends for the benefits gained elsewhere. The tattooing text (1100 characters) is already faster now, and when we start to talk about long, very repetitive strings, the performance increases get bigger (aaaaabaaaaacaaaaadaaaaaeaaaaa*1024)

Firefox:

screenshot_20170704_143807 screenshot_20170704_143819

Was already performing better, and the last code-change improved things even furths.

MOBILE:

2017-07-04 12 30 04 2017-07-04 12 30 41 2017-07-04 12 35 51 2017-07-04 12 37 17

In all honestly, I don't remember which of these was firefox and which of these was Chrome, but given the improvements in performance I don' think it really matters (but note that 81.07 / 7.16 > 11, so for the long repetitive strings, one of them had a speed multiplication that went up to eleven).

Other fun stuff: post-JSCompress the new code is 3895 characters, or 3.8 KiB vs the old 4.6 KiB, so that's about 17% smaller. I don't know if JSCompress is backwards compatible like LZ-string aims to be though.

@pieroxy, I think I've done all that I can realistically do. Could you test if this works (and performs better) in all relevant contexts? The code has gone through very drastic changes, so I'm sincerely worried about having broken it in some edge-cases.

EDIT: For the record, the gets-faster-for-repetitive-strings behaviour is actually super-useful for me, because I'm using this library on sparse data arrays. That is: arrays of (tens of) thousands of numbers, of which 90% are usually zero. So this is going to be a a big performance win for me.

JobLeonard avatar Jul 04 '17 13:07 JobLeonard

The benchmarks I used, on JSBench:

http://jsbench.github.io/#e25c445d987295b0114407e457dde9ad

http://jsbench.github.io/#4a2ca3e9e3e9ee544f5b76acc7699ef8

If people could please test this out in their own browsers, that would be very useful.

JobLeonard avatar Jul 04 '17 14:07 JobLeonard

In the infamous words of Columbo: just one more thing...

I think that that loop unrolling I mentioned earlier is not quite as hard as anticipated (although still hard, and it's still dubious if it gives significant performance benefits). Also, I highly doubt it would help for compression, but for decompression it might.

First, add a variable chunks = (32 / resetValue) | 0 (calculated after resetValue = Math.log2(resetValue)+1). Since bit-shifting works on 32 bit values, this represent the number of chars we can read out in bulk and store in data_val at once. For standard compression (16 bit UTF) this isn't much: just two characters at once, but for Base64 or URIsafe (six bits), that's five characters at once. That might improve branch prediction.

The only issues is that reading beyond the string length returns NaN. That's easy enough to fix though: Nan | 0 == 0, and any proper encoded string has an end-of-stream character so those trailing zeros shouldn't matter (this is the part where my code breaks so my assumptions are probably wrong :P).

Then, whenever we need to read in more charCodes for data_val, we do that in chunks:

//OLD:
while (power != maxpower) {
  // shifting has precedence over bitmasking
  bits += (data_val >> --data_position & 1) << power++;
  if (data_position == 0) {
    data_position = resetValue;
    data_val = getNextValue(data_index++);
  }
}

//NEW
while (power!=maxpower) {
   bits += ((data_val >> --data_position)&1) << power++;
   if (data_position == 0) {
     for(i = chunks; i--;){
        data_val = (getNextValue(data_index++) | 0) << (i*resetValue);
     }
     data_position = resetValue * chunks;
   }
}

So we added a while loop inside the if-statement. Is that an improvement? I don't know. The thing is that with this code, if (data_position == 0) will be false more often, potentially leading to better branch prediction (especially for 6-bit encodings). OTOH, that loop...

However, we can get even nastier with our hack: our encodings are never more than 16 bits, and never less than 6. So chunks will be a value between 2 and 5, and we can use Duff's Device and unroll this loop. Also, we can modify getNextValue to guarantee that out of bounds indices always return 0:

while (power != maxpower) {
  // shifting has precedence over bitmasking
  bits += (data_val >> --data_position & 1) << power++;
  if (data_position == 0) {
    data_val = 0;
    switch(chunks){
      case 5:
        data_val |= getNextValue(data_index++) << (4 * resetValue);
      case 4:
        data_val |= getNextValue(data_index++) << (3 * resetValue);
      case 3:
        data_val |= getNextValue(data_index++) << (2 * resetValue);
      case 2:
      default:
        data_val |= getNextValue(data_index++) << (1 * resetValue);
        data_val |= getNextValue(data_index++);
    }
    data_position = resetValue * chunks;
  }
}

Would this be faster? Again, I don't know. Needs to be tested.

JobLeonard avatar Jul 05 '17 11:07 JobLeonard

Answer after testing: nope, both are slower. So, sticking to the previous version of reading in characters one at a time.

Also, I realised my benchmarks were flawed: I measured both compression and decompression. However, compression is much slower than decompression, so I wasn't properly comparing decompression performance.

TL;DR: the new code is faster for decompression in all of my tests

Note that I'm only testing Linux and Android, so it's not exhaustive! Try benchmarks for yourself here:

JSBenchmarks here: Plain Decompression

Base64

Uint8

Explanation + screenshots from runs (includes error margins)

Chrome: consistently slightly faster except for the "all UTF16 characters" test. Gets better for very repetitive strings, being up to 1.5x faster.

Firefox: consistently 1.5x - 2x faster for every benchmark, and also faster than Chrome.

Mobile

I'm sorry, trying to take screenshots and stitching them together is really turning into a pain .... The graphs above are quite telling though.

With Firefox for Android the new code is running laps around the old version.

In Chrome for Android it's a bit more complicated. For plain compression the new code seems a little bit faster, but the error margins are more than 100% so no idea what that's supposed to mean (negative ops per second?). For Base64 and Uint8 however the new code is consistently a lot faster, in some cases five times faster!


Ok, I think I'm now truly done trying to optimise this... I hope there are no nasty edge-cases in my modifications that make it impossible to add this to the main branch.

There is one more thing that I can think of that could be done, but it would break backwards compatibility: don't reverse the bitsstream. Right now compress and decompress writes/reads bitstream tokens one bit at a time, resulting in reverse bits. However, because we know ahead of time how long these tokens will be, we can also just bitmask and shift the whole token at once. This however would remove the reversing (and thus break backwards compatibility because the output string would be different).

Instead of a for-loop, we'd have a single if-statement that checks if the token is on the bit-boundary (actually, this is not entirely true, because in theory the number of bits in the token can exceed the number of bits in the stream (especially likely for the Base64 case - six bits per character means the bitstring of a 16-bit UTF charCode spans three characters at least, potentially four), so we'd still end up with a kind of while-loop. However, if would be shifting in the data in chunks instead of 1 bit at a time).

This could end up being a lot faster, especially with decompression where reading in the bits is one of the bigger bottlenecks, if the above line profiling is to be believed.

But my sick leave ended yesterday so I should finish up this side-project :P

JobLeonard avatar Jul 05 '17 15:07 JobLeonard

Last bit of optimising, I swear! I can quit when I want to!

Anyway, the aaaaabaaaaacaaaaadaaaaaeaaaaa case is still slower (edit: for Chrome), but the gap has been reduced from 20% down to 5%-10%. I also used browserling to test IE10, which also seems very happy with the new version: performance improvements are comparable to Firefox (2x for short strings up to 6x for long repetitive strings).

JobLeonard avatar Jul 06 '17 09:07 JobLeonard

So remember how I said I can't think of any more ways to improve performance? Well, I came up with one possible improvement, but it doesn't seem to work out..

I'll just write it down in case JS engines change their optimisation patterns in the future (who knows, maybe for LZ-string 2.0 somewhere five years down the line, or if someone rewrites the whole function in a compiles-to-WASM language).

The idea is as follows:

  • Generally, arrays are faster than dictionary lookups (even with integer keys)
  • On the other hand, making an array that can fit all UTF16 characters would require a length of 65k, so using an array as a character node would create huge memory overhead that would slow things down.
  • Solution: split up value into upper- and lower- eight bits, and have a mini-tree of nested arrays, each 256 elements large
    • assumption: if other parts of UTF16 are used, these will mostly be similar values, since alphabets are grouped together, meaning we won't have many extra sub-nodes
    • likely downside: we'll be creating two arrays each time we insert a new charCode or prefix, instead of one object. Insertion might be much slower.
  • Additional optimisation: the vast majority of charCodes will be ASCII, meaning < 256. We can inline the first array, putting it in the lower 256 indices, and shift the remaining indices by 256, saving us an array lookup:
    if (charCode < 256){
        node = inlinedArray[charCode];
    } else {
        node = inlinedArray[255 + (charCode>>>8)][charCode&0xFF];
    }

Whether or not this is faster depends on a lot of things, like the time saved on the second array not created most of the time, array lookup speed, the overhead of an extra if/else branch, and of course how much input data will actually be ASCII values.

So after some first testing, this does not seem to produce consistent performance improvements. I basically tested two scenarios for look-up performance: trying all UTF charCode indices equally, and one win which I assume ASCII charCodes outnumber other charCodes 9:1 (but each charCode was still accessed at some point). Note that neither scenario is very realistic, because most of the time these trees will be relatively empty and only a few indices will be used. Also note that this doesn't test node creation performance, only look-up performance. For that I have a separate test:

lookup

insertion (I stopped testing early when it was clear this was much slower)

In short: array creation is universally slower, and insertion is slower than look-up (so probably the bottle-neck to optimise for). Array look-up is faster except in IE, where it's twice slower. Combined with the fact that insertion is the major bottleneck (as demonstrated in the line-by-line performance testing in Chrome), that makes this optimisation a no-go.

I can of course still implement it and test it, since real-world scenarios are different from these micro-benchmarks, but I don't have much hope.


Slightly different thing: I realised that the only way we can explain the sometimes massive speed-ups in compression speeds for repetitive long strings, is if the new code somehow improved Big O performance. After thinking about it some more, this is actually pretty obvious: the old data structure created ever longer string keys representing prefixes. This means performance is at the very least linearly dependent on the length of these keys, either for hashing or for string concatenation. Since the length of the prefix keys only gets very long for data that compresses very well, that explains why the new data structure sees the biggest speed-up in that scenario. By comparison, the new data structure only uses integers, meaning it has the same time overhead for every value.

JobLeonard avatar Jul 11 '17 18:07 JobLeonard

So I've been trying if the trie-approach might work for js-search as well. At first it seemed to be slower, until I tried an array-based approach:

  • typically, we'll be dealing with a small alphabet (Latin alphabet is 26 letters, times two for capitalisation, plus punctuation and some special characters)
  • of said alphabet, we are probably only using a subset in our words
  • of said subset, as soon as we're beyond the root of a prefix, we'll have an even smaller subset

With that in mind, a linear search through an array is probably going to be faster than any fancy algorithm, the same way that insertion sort beats everything else for arrays less than twenty items or so, through sheer simplicity and cache friendliness.

I just tried if the same structure would perform well with compression. ~~It doesn't~~ edit; of course it doesn't if the code has a bug... new results, it does, with a caveat:

http://jsbench.github.io/#a868adf796802a51a4e9add318572c70

I'm not even testing "all UTF16", because from the Jasmine test I already know basically freezes the tab - it's worst case behaviour after all, being at least quadratic. Another reason not to use this.

JobLeonard avatar Jul 18 '17 10:07 JobLeonard

Ignore my previous claim: jsbench seems to have lots of overhead messing up the benchmarks due to GC pauses or something, because I just tried again on jsperf and the array approach is much better for all "normal" data:

https://jsperf.com/lz-string-improvements-short-strings

image image 2017-07-18 13 44 29 2017-07-18 13 41 13

... however, that worst case performance (i.e. all UTF16) is bothering me. I'm not really worried about real data: if random numbers still compress faster, I'm sure average performance is better for all normal data people will use LZ-string for. However, that assumes people won't send in bad data on purpose to ruin your day. And I don't know how fast performanc degrades - what about compressing Chinese?

So my suggestion is to put this in a separate file, say lz-string-unsafe.js, enforcing that using this version is a conscious choice where the programmer looked at the potential issues. Because there are scenarios where you can guarantee it's safe to use and the benefits will be significant (like in my case, I only compress data we generate ourselves - and it's sparse data so the speed improvements will be significant).

JobLeonard avatar Jul 18 '17 11:07 JobLeonard

@JobLeonard Wouldn't using Map with a polyfill fallback further improve performance?

ronag avatar Jul 26 '17 10:07 ronag

Well, it definitely might be, but perhaps it makes more sense to have a separate map version and let people use polyfills in their build system. Otherwise you get code duplication.

JobLeonard avatar Jul 26 '17 10:07 JobLeonard