streamvbyte icon indicating copy to clipboard operation
streamvbyte copied to clipboard

Support for Signed Types Integrated in the CODECs

Open iiSeymour opened this issue 5 years ago • 39 comments

I have an interest in using streamvbyte with signed types (specifically int16) and have created a Python wrapper which supports this however the overhead is quite large as you would imagine https://github.com/iiSeymour/pystreamvbyte/issues/1.

Native support for an efficient zigzag encoding/decoding for int16 and int32 would be great. I imagine this is something that would vectorise well?

iiSeymour avatar May 06 '19 16:05 iiSeymour

Yes. I expect it would vectorize very well.

lemire avatar May 06 '19 16:05 lemire

Zigzag support for int32 is implemented in this branch: here

specifically: zigzag ops decoder encoder

aqrit avatar May 07 '19 20:05 aqrit

@aqrit Is there are clean way to bring this into the main streamvbyte repo? Want to issue a PR? Or do we do something dirtier?

lemire avatar May 07 '19 20:05 lemire

Is there are clean way to bring this into the main streamvbyte repo?

Not really, I touched everything. It currently lacks the ARM routines. My original plan was to implement a AVX2 table-less version then package it up... but I never got around to it.

aqrit avatar May 07 '19 20:05 aqrit

@iiSeymour

What about this then... As a stopgap measure, we add zigzag_decode and zigzag_encode functions to the library with definitions such as these...

https://gist.github.com/lemire/b6437fbd193395d8e4ccac1a5b2e50cc

This is such simple code that the autovectorizer should do a fine job (with GCC using -O3, with clang use -O2).

Then you can call these functions as needed, to convert your signed integers into unsigned integers and back.

It is not going to be as fast as having zigzag encoding integrating in the main functions, but it should be a lot faster than doing the conversion in Python or whatever else.

We would leave the issue open, to be resolved with better code... but you'd have an 'ok' fix meanwhile.

Thoughts?

lemire avatar May 07 '19 21:05 lemire

Sounds like a plan 👍

iiSeymour avatar May 07 '19 21:05 iiSeymour

Implemented in the latest release. Make sure to compile with aggressive optimization flags and it should be quite fast.

lemire avatar May 07 '19 21:05 lemire

Testing some implementations we have internally I've found the following code to perform quite well:

https://gist.github.com/jorj1988/a2f2f14719d46074900dd563dc31a1c7

It does a delta >> zigzag >> expand to 32 bit >> streambyte encode

0x55555555 avatar May 08 '19 10:05 0x55555555

Thanks. That is a useful reference.

lemire avatar May 08 '19 12:05 lemire

With delta zigzag as per #29 I see about a 40% improvement over my Python implementation with int32 arrays and 20% with int16 (the extra cast is expensive).

iiSeymour avatar May 08 '19 13:05 iiSeymour

@jorj1988 thanks, I'll have a go at benchmarking your implementation and see what kind of performance improvements it yields.

@aqrit I would certainly benefit from an AVX2 implementation if you're still up for it.

iiSeymour avatar May 08 '19 16:05 iiSeymour

A look at the performance across types https://github.com/iiSeymour/pystreamvbyte/issues/1#issuecomment-492728720

I had a quick go at modifying the tests/perf utility to reproduce in C.

streamvbyte_encoding streamvbyte_decoding

iiSeymour avatar May 15 '19 18:05 iiSeymour

I understood the 'int32' perf as in "you need to apply zigzag and then compress". However, I am puzzled regarding 'int16'. What do you do exactly?

lemire avatar May 15 '19 19:05 lemire

My primary use case is working with int16 data and that means an expensive copy to cast up in Python before every encode/decode.

iiSeymour avatar May 15 '19 20:05 iiSeymour

For encoding 16-bit words a shuffle table wouldn't even be needed. A simple 8-byte prefix sum could generate the shuffle mask. This is because (at best) only every other byte gets dropped (using the uint32_t "1234" format).

aqrit avatar May 15 '19 23:05 aqrit

I've experimented with a prefix sum for shuffle and didnt see much of a performance increase for my test set. I'd be interested in seeing a more competent implementation of this?

0x55555555 avatar May 16 '19 08:05 0x55555555

I was unable to find any shortcuts for encoding using prefix-sums. My apologies.

Zigzag is overkill for int16, assuming these are the only two states: A.) Two byte literal. B.) One byte sign extend.

aqrit avatar Jun 03 '19 18:06 aqrit

Hmm, I sort of see what you're saying, but I'm having trouble putting together code to implement it - do you have a psuedo code example of what youre suggesting?

0x55555555 avatar Jun 05 '19 19:06 0x55555555

It it is easier to check bit_7 than it is to check bit_0.

m = pcmpgtb 0, v
m = psllw m, 8
v = pxor v, m

I don't think anything is worth doing while they are being piped through the int32 functions. It'd be pretty easy to add real support for int16... but it is too boring.

I've been thinking about doing some avx2 code for shorts... I'd repurpose my "despacer" for the encoder. my current thoughts on a decoder

aqrit avatar Jun 05 '19 20:06 aqrit

thanks - ill take a look

0x55555555 avatar Jun 05 '19 20:06 0x55555555

attempt at uint16 support using SSSE3 https://github.com/aqrit/streamvbyte/blob/master/contrib/short_varint.c (strategies chosen on a whim)

aqrit avatar Jul 21 '19 12:07 aqrit

@aqrit here are some performance results against using uint16 with pystreamvbyte (looking good!).

$ seq 3 8 | xargs -I% ./tests/perf --dtype uint16 --size 1e%
                   streamvbyte_encode                   
encoding time 0.000878 s,     113,941,280 uint16s/sec ( 0.228 GB/s)
decoding time 0.000846 s,     118,175,512 uint16s/sec ( 0.236 GB/s)
Compressed 2,000 bytes to 1,250 bytes (62.50%)
                       encode_16                        
encoding time 0.000617 s,     162,136,568 uint16s/sec ( 0.324 GB/s)
decoding time 0.000710 s,     140,923,275 uint16s/sec ( 0.282 GB/s)
Compressed 2,000 bytes to 1,125 bytes (56.25%)
                   streamvbyte_encode                   
encoding time 0.001599 s,     625,567,691 uint16s/sec ( 1.251 GB/s)
decoding time 0.001213 s,     824,655,188 uint16s/sec ( 1.649 GB/s)
Compressed 20,000 bytes to 12,500 bytes (62.50%)
                       encode_16                        
encoding time 0.000885 s,   1,130,241,092 uint16s/sec ( 2.260 GB/s)
decoding time 0.000759 s,   1,316,818,644 uint16s/sec ( 2.634 GB/s)
Compressed 20,000 bytes to 11,250 bytes (56.25%)
                   streamvbyte_encode                   
encoding time 0.015245 s,     655,947,521 uint16s/sec ( 1.312 GB/s)
decoding time 0.007395 s,   1,352,308,383 uint16s/sec ( 2.705 GB/s)
Compressed 200,000 bytes to 125,000 bytes (62.50%)
                       encode_16                        
encoding time 0.004682 s,   2,135,859,446 uint16s/sec ( 4.272 GB/s)
decoding time 0.002929 s,   3,414,694,145 uint16s/sec ( 6.829 GB/s)
Compressed 200,000 bytes to 112,500 bytes (56.25%)
                   streamvbyte_encode                   
encoding time 0.071367 s,   1,401,202,893 uint16s/sec ( 2.802 GB/s)
decoding time 0.049594 s,   2,016,379,047 uint16s/sec ( 4.033 GB/s)
Compressed 2,000,000 bytes to 1,250,000 bytes (62.50%)
                       encode_16                        
encoding time 0.029260 s,   3,417,658,937 uint16s/sec ( 6.835 GB/s)
decoding time 0.017225 s,   5,805,457,615 uint16s/sec (11.611 GB/s)
Compressed 2,000,000 bytes to 1,125,000 bytes (56.25%)
                   streamvbyte_encode                   
encoding time 3.109820 s,     321,562,067 uint16s/sec ( 0.643 GB/s)
decoding time 2.542501 s,     393,313,440 uint16s/sec ( 0.787 GB/s)
Compressed 20,000,000 bytes to 12,500,000 bytes (62.50%)
                       encode_16                        
encoding time 0.680088 s,   1,470,398,713 uint16s/sec ( 2.941 GB/s)
decoding time 0.281700 s,   3,549,875,880 uint16s/sec ( 7.100 GB/s)
Compressed 20,000,000 bytes to 11,250,000 bytes (56.25%)
                   streamvbyte_encode                   
encoding time 31.072751 s,     321,825,382 uint16s/sec ( 0.644 GB/s)
decoding time 32.966249 s,     303,340,543 uint16s/sec ( 0.607 GB/s)
Compressed 200,000,000 bytes to 125,000,000 bytes (62.50%)
                       encode_16                        
encoding time 8.543753 s,   1,170,445,798 uint16s/sec ( 2.341 GB/s)
decoding time 11.135855 s,     898,000,222 uint16s/sec ( 1.796 GB/s)
Compressed 200,000,000 bytes to 112,500,000 bytes (56.25%)

Array size vs throughput

short_encode

Any further thoughts on int16?

iiSeymour avatar Jul 22 '19 13:07 iiSeymour

Someone want to prepare a PR based on @aqrit 's work?

lemire avatar Jul 22 '19 13:07 lemire

Any further thoughts on int16?

Not sure what you’re looking for...

Here a simple RLE of the HIBYTEs was used. (the extra subtract is used to turn it into RLE0)

uint16 could be the same speed as uint32 if one wanted to use a 4 KiB dictionary. But I like a 64-bit multiply better... For AVX2, I still like shift/add of nibbles to do the prefix sum.

The decoder’s inner loop could be fully unrolled. The decoder’s scalar tail loop should probably be using cmov.

There is a lot of "room for improvement" in the encoder.

Signed integer support is just a couple extra instructions.

ramblings: If one wanted to try for a little better compression ratio... One could move back towards a traditional “VByte”/sql-lite style. psubusb would let us choose any particular partition point (0x80...0xFE) So say, if we knew that all values in a block used less than 13 bits, we could choose a partition of 240. Where values above 240 use 2 bytes and values below use 1 byte. If all 16 bits were needed in a block then we’d put the partition at 128 and pick up all the sign bits as extra 1-bit per word overhead. This strat might deal better with noise than a bit-packer?

aqrit avatar Jul 23 '19 01:07 aqrit

There is a lot of "room for improvement" in the encoder.

Encoding speed is more important than decoding for my use-case. Do you think AVX2 would result in a significant improvement?

Signed integer support is just a couple extra instructions.

Could you push a version? I'm more comfortable in Python and C with intrinsics scare me :smile:

If one wanted to try for a little better compression ratio...

I've found a pass of zstd (level 1) after svb is very effective but quite costly in comparison. For the data I'm primarily interested in the range is actually only 11 bits and with delta encoding ~6.

iiSeymour avatar Jul 23 '19 13:07 iiSeymour

I'll post something this weekend. If optimizing for encoding speed, the pseudo zigzag operation only requires 1 instruction (to encode).

AVX2 The "despacer" step for AVX2 is pretty much the same speed. Maybe gather would do something... on a few of the new(ish) machines?

zstd Using FSE directly may be better?

aqrit avatar Jul 25 '19 13:07 aqrit

If a 16-bit value in the range 0...255 is encoded as 1 byte and everything else is encoded as 2 bytes.

How does one instead encode the the range -128...127 as 1 byte? Answer: just add/subtract 128 to/from each value.

I can't decide how to optimize the encoder... So instead here is an non-optimized version w/delta coding and signed integer support (that may or may not be correct) https://gist.github.com/aqrit/ef84d284cebe861d9e57db4129bcafc3

aqrit avatar Jul 28 '19 23:07 aqrit

@aqrit I think I am missing something. I can see the delta coding but I'm confused on the signed integer support as the functions still expect uint16_t. Is the idea that calling function will shift the values by whatever the range calls for?

Anyway, tests seem to pass and here are some performance numbers -

$ seq 3 8 | xargs -I% ./tests/perf --dtype uint16 --size 1e%
                   streamvbyte_encode                   
encoding time 0.000732 s,     136,590,931 uint16s/sec ( 0.273 GB/s)
decoding time 0.000711 s,     140,561,579 uint16s/sec ( 0.281 GB/s)
Compressed 2,000 bytes to 1,250 bytes (62.50%)
                       encode_16_delta                        
encoding time 0.000477 s,     209,727,181 uint16s/sec ( 0.419 GB/s)
decoding time 0.000554 s,     180,504,078 uint16s/sec ( 0.361 GB/s)
Compressed 2,000 bytes to 1,343 bytes (67.15%)
                   streamvbyte_encode                   
encoding time 0.001404 s,     712,167,519 uint16s/sec ( 1.424 GB/s)
decoding time 0.001162 s,     860,649,607 uint16s/sec ( 1.721 GB/s)
Compressed 20,000 bytes to 12,500 bytes (62.50%)
                       encode_16_delta                        
encoding time 0.000861 s,   1,161,541,277 uint16s/sec ( 2.323 GB/s)
decoding time 0.000940 s,   1,064,149,052 uint16s/sec ( 2.128 GB/s)
Compressed 20,000 bytes to 13,396 bytes (66.98%)
                   streamvbyte_encode                   
encoding time 0.014632 s,     683,416,284 uint16s/sec ( 1.367 GB/s)
decoding time 0.006711 s,   1,489,996,555 uint16s/sec ( 2.980 GB/s)
Compressed 200,000 bytes to 125,000 bytes (62.50%)
                       encode_16_delta                        
encoding time 0.004137 s,   2,417,190,092 uint16s/sec ( 4.834 GB/s)
decoding time 0.004507 s,   2,218,526,123 uint16s/sec ( 4.437 GB/s)
Compressed 200,000 bytes to 134,093 bytes (67.05%)
                   streamvbyte_encode                   
encoding time 0.073112 s,   1,367,762,691 uint16s/sec ( 2.736 GB/s)
decoding time 0.050156 s,   1,993,780,085 uint16s/sec ( 3.988 GB/s)
Compressed 2,000,000 bytes to 1,250,000 bytes (62.50%)
                       encode_16_delta                        
encoding time 0.032616 s,   3,065,940,498 uint16s/sec ( 6.132 GB/s)
decoding time 0.034485 s,   2,899,804,526 uint16s/sec ( 5.800 GB/s)
Compressed 2,000,000 bytes to 1,343,839 bytes (67.19%)
                   streamvbyte_encode                   
encoding time 3.970338 s,     251,867,737 uint16s/sec ( 0.504 GB/s)
decoding time 3.336252 s,     299,737,591 uint16s/sec ( 0.599 GB/s)
Compressed 20,000,000 bytes to 12,500,000 bytes (62.50%)
                       encode_16_delta                        
encoding time 1.036077 s,     965,179,624 uint16s/sec ( 1.930 GB/s)
decoding time 0.392249 s,   2,549,402,805 uint16s/sec ( 5.099 GB/s)
Compressed 20,000,000 bytes to 13,427,945 bytes (67.14%)
                   streamvbyte_encode                   
encoding time 34.064612 s,     293,559,780 uint16s/sec ( 0.587 GB/s)
decoding time 36.469576 s,     274,201,159 uint16s/sec ( 0.548 GB/s)
Compressed 200,000,000 bytes to 125,000,000 bytes (62.50%)
                       encode_16_delta                        
encoding time 10.684904 s,     935,899,845 uint16s/sec ( 1.872 GB/s)
decoding time 13.823414 s,     723,410,296 uint16s/sec ( 1.447 GB/s)
Compressed 200,000,000 bytes to 134,275,973 bytes (67.14%)

iiSeymour avatar Jul 29 '19 14:07 iiSeymour

It is the exact same operation for signed or unsigned numbers (AFAIK).

If the function signature was changed to int16_t.... Then I'd probably cast it back to uint16_t inside the function. (because bit-twiddling is done on unsigned numbers in C)

Do not try to opcode the sign. That is impossible. Instead, only try to realize the truth: there is no sign. Then you'll see it is not the sign-type that differentiates, it is only yourself.

aqrit avatar Jul 29 '19 20:07 aqrit

Using FSE directly may be better?

I spent some time looking at fse and it seems streamvbyte + zstd is hard to beat (encoding).

Time (s) Compression (%)
zstd 1.002 63.98
fse 0.690 64.81
svb + fse 0.632 41.53
svb + zstd 0.431 37.00
svb 0.241 63.53

Using your latest short_enc yields impressive results.

Time (s) Compression (%)
svb16 + zstd 0.242 37.13
svb16 0.048 56.28

iiSeymour avatar Aug 01 '19 15:08 iiSeymour