lz-string icon indicating copy to clipboard operation
lz-string copied to clipboard

Proposal: define an LZString spec + version scheme

Open JobLeonard opened this issue 5 years ago • 3 comments

So there is no official spec at the moment - the JS implementation is the reference implementation.

Perhaps it is worth making a clear spec instead. Here is my proposal for what to consider when making such a specification (not for a spec itself just yet - after all, let's first agree whether we want one or not).

High-Level Overview

An LZString spec 1.0 (LZ1) sould unambiguously describe the current formats for each compression variant. My high-level proposal is that we structure it as:

  1. The compression/decompression algorithm itself, up until the part where tokens are sent to a bitstream, or decompressed from a bitstream.
  2. The different bitstream-to-string and back algorithms, that gives us the many variants that we currently have.
  3. Easily understandable, highly commented reference implementations (as in, not my fast implementation :P)
  4. Unit Tests that fit the spec and edge cases?

Points 1 and 2 basically reflect the decomposition we currently use in the library already. 3 and 4 are good to help others implement their own ports, and probably a good exercise to find any bugs.

The versioning is of the compression format, not the library; changing the library version has no effect on the specification of the format as long as it compresses/decompresses the same.

When To Make A New Spec

The LZString spec 2.0 and beyond (LZ2+) would be for compression formats that have yet to be written, and should also be used for the situation where changes to either core compression algorithm or bitstream-to-string output would break backwards compatibility.

Now to be clear: just like how I think it's cool that the library is pre-ES5 compatible and would love to maintain that just as a kind of badge of honor despite diminishing needs, supporting LZ1 for as long as possible would be a Very Good Thing in this world of constant backwards-compatibility upgrades. (also, that is part of the point of making a spec: now, it is kind of implied that a new version of LZString will still work. But it is not officially guaranteed anywhere that it will with a new library update. With a spec we can explicitly say what is and isn't supported)

There must be really compelling benefits to create an LZ2+ spec, and it should not happen lightly (I might have some ideas, but that's a story for another topic).

I also think that the parts that break should break in a reliable way, even with old libraries. Which leads us to:

The Backwards Compatible Header Hack

If we go so far as to make a new compression scheme that breaks with LZ1, it would be cool if there were some kind of way for LZString libraries to detect which compression version is being used. Even better would be if we would not have to worry about LZ2+ compressed strings being passed to old LZString code. Also, we might want to be able to add more metadata (although I really don't see what kind of metadata that would be TBH).

Well, I think I came up with some possible tricks for all of that! First, note that LZ1 has the following default values for tokens:

0: new 8-bit char code 1: new 16-bit char code 2: termination token

Elsewhere, I suggested that for a hypothetical LZ2+ we re-order that to:

0: termination token 1: 8-bit 2: 16-bit 3+ any other new special tokens we might add

So the two of these can be combined to give us our solution. What I propose is this:

  1. With LZ2+ formats, we re-order the token values as proposed.
  2. With LZ2+ formats, the bitstream will always start with 10, the equivalent to an LZ1 termination token.

The consequence of this is that LZ1 libraries trying to decompress LZ2+ strings just return... an empty string. Not only that, we can even update the current library to thrown an error if (for example) we have a string that is longer than the expected padding, yet starts with a termination token. Even without any LZ2+ formats this is likely desirable behavior, since LZString would never produce such a string - if such a string were passed to the decompressor we would clearly be dealing with invalid input.

The cost would be two bits extra with every LZ2+ bitstream. I think that's acceptable.

Fine-Grained Versioning

The previous trick would be enough to distinguish current compression from any future compression schemes.

If we want more fine-grained versioning in LZ2+ formats, we could add a version bitstring. But just reserving N bytes for that is is wasteful, so I came up with the following posit-inspired scheme for storing an integer of flexible bit-length:

  1. zero or more length bits set as 1, terminated by a zero-bit (similar to how the regime bits work in posits).
  2. said length x 3 number of integer bits

A simple version of this scheme would basically encode chunks of three bits to encode an integer. However, if we just interpreted the integer bits like that, then 0000, 10000000, 110000000000, etc. would all encode zero. That's a waste of bits. A smarter encoding would be:

Bit string Number Conversion
0000 0
0001 1
0010 2
0011 3
0100 4
... ...
0111 7
10000000 8 + 0
10000001 8 + 1 = 9
... ...
10111111 8 + 63 = 71
110000000000 64 + 8 = 72
110000000001 64 + 8 = 73
110111111111 64 + 8 + 511 = 73
1110000000000000 512 + 64 + 8 = 584
... ...
1110111111111111 512 + 64 + 8 + 4095 = 4679

Basically, the length bits can do double duty by encoding an integer offset as the sum of the power of 8:

let offset = 0;
for (let i = 0, power = 8; i < totalLengthBits; i++){
  offset += power
  power *= 8;
}

This is then added to the integer value encoded by the integer bitstream.

This would add a minimum four bits to each bit-stream. Well, until we hit spec version eight, then it's eight. But I doubt we'll increase versionis that much anyway. Similarly, we could even use semantic versioning by using three of these, but I doubt that is necessary.

So what's the use of this? Well, for starters, we currently have the following compression variants:

  • compress
  • compressToUTF16
  • compressToBase64
  • compressToEncodedURIComponent
  • compressToUint8Array

As noted, these basically wrap around the core compression algorithm and give different conversions from bitstream to string.

We could assign each of these a version (0 to 4). Then a single decompress function could detect which of these is used and call the appropriate decompression function, making the library easier to use. Another use would be, again, to signal incompatible changes, or add new compression schemes without breaking old ones.

The cost of this would be a minimum of 4 bits, and possibly more if we have a lot of version changes (which I doubt). It is very unlikely we ever need more than 8 or 12 extra bits though, unless we go version-crazy:

Bits Available Max Value
4 8 - 1 = 7
8 8² + 8 - 1 = 71
12 8³ + 8² + 8 - 1 = 583
16 8⁴ + 8³ + 8² + 8 - 1 = 4679
20 8⁵ + 8⁴ + 8³ + 8² + 8 - 1 = 37447

Remember, this is the versioning of the specification for the compression format, not the version of the library. It shouldn't change very often.

Metadata

So like I said, I don't really know what kind of metadata we might want to add beyond versioning, but here is a trick I came up with: make the metadata a compressed string. "Say wha?"

So far, my proposal for a hypothetical LZ2+ spec "template" is that the bitstream has the following characteristic:

10|####+|########+
A  B     C
  • A: Backwards Compatible Header Hack
  • B: Version Bitstring
  • C: Bitstream of compressed data

Now, note that C always should end with the termination token 0.

What I suggest for compression schemes that need metadata is basically:

10|####+|######+|#####+
A  B     C       D
  • A: Backwards Compatible Header Hack
  • B: Version Bitstring
  • C: Bitstream of compressed metadata
  • D: Bitstream of compressed data

If there is no metadata, this would simply add a single termination token, which would be however many bits we would need for the built-in tokens (currently 2).

I don't really see the point of adding metadata, so this would be more of a neat trick for whatever compression scheme does require metadata. Could be useful for #121 maybe.

Anyway, long post, lots of thoughts and probably some unwanted complexity. But again: the point is to keep LZ1 simple and working while adding possibilities to create new versions without breaking it.

Thoughts?

JobLeonard avatar Oct 19 '18 14:10 JobLeonard

I like it - I feel that there could be 2-3 potential encode nibbles - one for the version, one for the encoding type, and potentially one for the encoding type itself to store settings for itself to allow better decoding (though not sure that it's needed for many / any - maybe a high order bit to say "+ data"?

Rycochet avatar Oct 19 '18 17:10 Rycochet

We could also stick to one version equalling one format, then per format decide if and how to add more metadata, or settings like you said, or it's own sub-versioning scheme.

JobLeonard avatar Oct 19 '18 18:10 JobLeonard

Very good proposal. I suggest to use it as alpha version of the specification.

gloryknight avatar Dec 21 '18 21:12 gloryknight