squashfs-tools-ng icon indicating copy to clipboard operation
squashfs-tools-ng copied to clipboard

Performance improvements

Open AgentD opened this issue 4 years ago • 4 comments

A lot of code was written in a "keep it simple, make it work, worry about performance/efficiency later (but keep it in mind and design the data structures flexible & opaque enough)" mentality.

As a result some areas could stand improvement:

  • The block writer also cares about deduplication, using linear search in an append-only array of block size & hash. This could be improved by organizing the on-disk block list as a trie/prefix tree.
  • ~The xattr writer still uses the simplistic, separate chaining hash table I wrote for a university assignment when I was 20. This has already become a bottle neck and has been replaced in the fragment code path (see #40).~
  • ~The xattr writer does a linear search over all xattr blocks for deduplication. This already is a problem (see #68). This could either use a hash-table or tree. As a workaround, a manual flag to disable deduplication in known problematic cases may also do.~
  • ~The xattr writer currently only does full matches (plus qsort to catch arbitrary permutations), but could also do subset matches. E.g if we have a lot of xattr kv-pairs in the shape ABC, ABD, ABE, ABF, ... which causes the problems noted above, they are currently stored as "ABCABDABEABF..." with references from the ID table into the list, but could be compacted to e.g. "DABCEABF..." with an ID table that points to overlapping blocks. Note however, that~ this is essentially the NP-complete set packing problem.
  • Multi threaded decompression (#53).

AgentD avatar Oct 16 '20 16:10 AgentD

Some initial investigation shows that an LRU fragment block cache in the data reader can shave off some ~10% from unpacking an image. A random-access benchmark would be reasonable.

AgentD avatar Jul 19 '21 09:07 AgentD

using linear search in an append-only array of block size & hash. This could be improved by organizing the on-disk block list as a trie/prefix tree.

This one is particularly bad for a use case I'm investigating right now, with large archives. The gensquashfs gets slower and slower as the archive proceeds, until it's almost at a crawl towards the end. A trie would be ideal of course, but even a hash map of hash to first block with that ID, and a linked list of blocks with the same hash could work.

mattgodbolt avatar Sep 14 '22 22:09 mattgodbolt

In fairness I now realise it could be other areas of the code. Just from reading the source I figured this could explain the apparently O(n²) feel of the slowdown.

mattgodbolt avatar Sep 14 '22 22:09 mattgodbolt

I've started looking into the block processor performance during the weekend.

The approach I'm currently looking into is based on using a hash table that maps the block checksums to the blk_info_t array. Then, during de-duplication, swap out the key-equals function to compare the entire cluster and associated data if it matches, and remember the candidate with the lowest index.

I'm still ironing out some bugs at the moment, as well as cobbling together a benchmark. How/if it improves performance remains to be seen.

AgentD avatar Sep 20 '22 10:09 AgentD

FWIW, the next release of xz (5.4.0, it's in alpha right now, planned for early December or so) supports parallel decompression:

  • https://git.tukaani.org/?p=xz.git;a=commit;h=4cce3e27f529af33e0e7749a8cbcec59954946b5
  • https://git.tukaani.org/?p=xz.git;a=commit;h=6c6da57ae2aa962aabde6892442227063d87e88c

thesamesam avatar Nov 16 '22 19:11 thesamesam