ncompress icon indicating copy to clipboard operation
ncompress copied to clipboard

Sparse array as compressor dictionary

Open andrew-aladev opened this issue 6 years ago • 3 comments

Hello. In 2019 year we have a great amount of ram available. So we can reach max possible performance for lzw (with clear) algorithm.

The idea is simple: we can use sparse array instead of double hashing array as dictionary. Please imagine big array where (code << 8) | symbol => next_code. symbol is between 0, 255, code between 0 and (2 ** 16) - 1 and next_code between 257 and (2 ** 16) - 1. 33.5 MB ram is required for such array.

The problem is that we have to clear sparse array. For example we have to clear dictionary 2040 times for compressing 850 MB linux-4.20.3.tar. 33.5 MB * 2040 ~ 68 GB. I've solved this issue by collecting used sparse array indexes in separate array and clear just these indexes.

The complexity of insert or find is still O(1). You can find docs here. Implementation is here.

I am about to be sure that ncompress won't accept such huge memory eater. I want just to inform people. Thank you.

andrew-aladev avatar Jan 17 '19 22:01 andrew-aladev

stepping back, the compress format is pretty uncommon nowadays. it shows up with a lot of older files, but nowadays files are compressed using something like XZ when they care about speed/size. this is 2019 after all and there are many free/open source compression algorithms out there to choose from that are way faster & smaller than LZW :).

so i have a hard time justifying significant changes to the ncompress package when, in practice, people simply aren't using compress anymore, and nor should they.

that isn't to say collecting thoughts and feedback from people such as yourself aren't useful. thanks for posting your findings/research and code for others to utilize.

vapier avatar Jan 19 '19 21:01 vapier

LZW compress is old. The problem is that it was used all over the world for creating smth.tar.Z and content via ancient web servers. Today nobody will try to use lzw, it is ok. But we still want to support ancient applications. The problem is that application/x-compress has undocumented underwater rocks. We will reverse engineer it and restore ability to support ancient software.

andrew-aladev avatar Jan 20 '19 11:01 andrew-aladev

I've experimented a bit around with this approach but it doesn't seem to be any faster than the hashtable. Using ch << 16 | code I get around 42ms compressing a 1080p binary ppm file. With hashtable I get 43ms.

Using perf I can see that the vast majority of time is spent stalled waiting for memory to arrive from the sparse array, since the array is too big to fit into the L1 or L2 cache.

Also interestingly, when I use code << 8 | ch to index into the array, performance gets MUCH worse: 110ms. I think this is because when ch is the high bits of the index, the index tends to stick closer to recently accessed indices and thus exploit better cache performance.

TL;DR: sparse array doesn't seem worthwhile to use over a hash-table.

N-R-K avatar Apr 22 '24 07:04 N-R-K