urbit u3: optimize bit slices

u3: optimize bit slices

Open joemfb opened this issue 2 years ago • 0 comments

This PR optimizes the bitstream bytes read/write implementations (used for some jam/cue implementations), and rewrites u3r_chop() (used for other jam/cue implementations, and many jets). In all cases, the main optimization is expressing the inner loop in terms that the compiler can vectorize. The previous implementation of u3r_chop() worked a bit at a time for bloq sizes less than 5, and the bitstream implementations had loop-carried dependencies.

compilation comparisons (note the longer instruction names!):

bitstream write bytes: https://godbolt.org/z/36851bqxd
bitstream read bytes: https://godbolt.org/z/bEE56E6xd
chop: https://godbolt.org/z/KEYW9Gesn

The bitstream implementation already has extensive permutation tests, using a bit at a time implementation as the "golden master". The PR adds similar tests for u3r_chop(), using the old implementation as the "golden master".

I haven't had a chance to extensively profile these changes. Early results (per make bench) have over 100% improvements for the jam jet (on small inputs), and minor improvements for the others. I expect significant improvements for large inputs wherever these are called.

This PR is the prerequisite to a better version of #5676.

Sep 21 '22 17:09 joemfb

urbit urbit copied to clipboard

u3: optimize bit slices

urbit
urbit copied to clipboard