squashfs-tools-ng
squashfs-tools-ng copied to clipboard
Performance improvements
A lot of code was written in a "keep it simple, make it work, worry about performance/efficiency later (but keep it in mind and design the data structures flexible & opaque enough)" mentality.
As a result some areas could stand improvement:
- The block writer also cares about deduplication, using linear search in an append-only array of block size & hash. This could be improved by organizing the on-disk block list as a trie/prefix tree.
- ~The xattr writer still uses the simplistic, separate chaining hash table I wrote for a university assignment when I was 20. This has already become a bottle neck and has been replaced in the fragment code path (see #40).~
- ~The xattr writer does a linear search over all xattr blocks for deduplication. This already is a problem (see #68). This could either use a hash-table or tree. As a workaround, a manual flag to disable deduplication in known problematic cases may also do.~
- ~The xattr writer currently only does full matches (plus
qsort
to catch arbitrary permutations), but could also do subset matches. E.g if we have a lot of xattr kv-pairs in the shape ABC, ABD, ABE, ABF, ... which causes the problems noted above, they are currently stored as "ABCABDABEABF..." with references from the ID table into the list, but could be compacted to e.g. "DABCEABF..." with an ID table that points to overlapping blocks. Note however, that~ this is essentially the NP-complete set packing problem. - Multi threaded decompression (#53).
Some initial investigation shows that an LRU fragment block cache in the data reader can shave off some ~10% from unpacking an image. A random-access benchmark would be reasonable.
using linear search in an append-only array of block size & hash. This could be improved by organizing the on-disk block list as a trie/prefix tree.
This one is particularly bad for a use case I'm investigating right now, with large archives. The gensquashfs
gets slower and slower as the archive proceeds, until it's almost at a crawl towards the end. A trie would be ideal of course, but even a hash map of hash to first block with that ID, and a linked list of blocks with the same hash could work.
In fairness I now realise it could be other areas of the code. Just from reading the source I figured this could explain the apparently O(n²) feel of the slowdown.
I've started looking into the block processor performance during the weekend.
The approach I'm currently looking into is based on using a hash table that maps the block checksums to the blk_info_t array. Then, during de-duplication, swap out the key-equals function to compare the entire cluster and associated data if it matches, and remember the candidate with the lowest index.
I'm still ironing out some bugs at the moment, as well as cobbling together a benchmark. How/if it improves performance remains to be seen.
FWIW, the next release of xz (5.4.0, it's in alpha right now, planned for early December or so) supports parallel decompression:
- https://git.tukaani.org/?p=xz.git;a=commit;h=4cce3e27f529af33e0e7749a8cbcec59954946b5
- https://git.tukaani.org/?p=xz.git;a=commit;h=6c6da57ae2aa962aabde6892442227063d87e88c