urbit
urbit copied to clipboard
u3: optimize bit slices
This PR optimizes the bitstream bytes read/write implementations (used for some jam/cue implementations), and rewrites u3r_chop()
(used for other jam/cue implementations, and many jets). In all cases, the main optimization is expressing the inner loop in terms that the compiler can vectorize. The previous implementation of u3r_chop()
worked a bit at a time for bloq sizes less than 5, and the bitstream implementations had loop-carried dependencies.
compilation comparisons (note the longer instruction names!):
- bitstream write bytes: https://godbolt.org/z/36851bqxd
- bitstream read bytes: https://godbolt.org/z/bEE56E6xd
- chop: https://godbolt.org/z/KEYW9Gesn
The bitstream implementation already has extensive permutation tests, using a bit at a time implementation as the "golden master". The PR adds similar tests for u3r_chop()
, using the old implementation as the "golden master".
I haven't had a chance to extensively profile these changes. Early results (per make bench
) have over 100% improvements for the jam jet (on small inputs), and minor improvements for the others. I expect significant improvements for large inputs wherever these are called.
This PR is the prerequisite to a better version of #5676.