despacer icon indicating copy to clipboard operation
despacer copied to clipboard

Avoid unnecessary sign-extending instructions

Open sharpobject opened this issue 2 years ago • 2 comments

Use unsigned types to store the result of popcnt and movemask because otherwise we will get a movsx to sign-extend these values (when we subsequently use them as indices into an array or whatever), which is unnecessary in almost all cases and incorrect if it ever does anything

sharpobject avatar Dec 04 '23 05:12 sharpobject

Implementations actually affected by this patch seem to be these on my Haswell server:

avx2_despace_branchless(buffer, N)                :  base frequency  3.91 GHz speed:  10.80 GB/s -> 11.08 GB/s
avx2_despace_branchless(buffer, N)                :  base frequency  3.91 GHz speed:  10.83 GB/s -> 11.07 GB/s
avx2_despace_branchless(buffer, N)                :  base frequency  3.91 GHz speed:  10.88 GB/s -> 11.03 GB/s
sse4_despace_branchless_u2(buffer, N)             :  base frequency  3.91 GHz speed:  8.71 GB/s -> 8.50 GB/s
sse4_despace_branchless_u2(buffer, N)             :  base frequency  3.91 GHz speed:  8.67 GB/s -> 8.48 GB/s
sse4_despace_branchless_u2(buffer, N)             :  base frequency  3.91 GHz speed:  8.67 GB/s -> 8.48 GB/s
sse4_despace_branchless_u4(buffer, N)             :  base frequency  3.91 GHz speed:  8.77 GB/s -> 8.47 GB/s
sse4_despace_branchless_u4(buffer, N)             :  base frequency  3.91 GHz speed:  8.77 GB/s -> 8.50 GB/s
sse4_despace_branchless_u4(buffer, N)             :  base frequency  3.91 GHz speed:  8.74 GB/s -> 8.36 GB/s
sse4_despace_skinny_u4(buffer, N)                 :  base frequency  3.91 GHz speed:  7.56 GB/s -> 7.72 GB/s
sse4_despace_skinny_u4(buffer, N)                 :  base frequency  3.91 GHz speed:  7.56 GB/s -> 7.80 GB/s
sse4_despace_skinny_u4(buffer, N)                 :  base frequency  3.91 GHz speed:  7.56 GB/s -> 7.69 GB/s
sse42_despace_branchless(buffer, N)               :  base frequency  3.91 GHz speed:  7.82 GB/s -> 7.85 GB/s
sse42_despace_branchless(buffer, N)               :  base frequency  3.91 GHz speed:  7.82 GB/s -> 7.85 GB/s
sse42_despace_branchless(buffer, N)               :  base frequency  3.91 GHz speed:  7.82 GB/s -> 7.85 GB/s
sse42_despace_branchless_lookup(buffer, N)        :  base frequency  3.91 GHz speed:  7.09 GB/s -> 7.85 GB/s
sse42_despace_branchless_lookup(buffer, N)        :  base frequency  3.91 GHz speed:  7.09 GB/s -> 7.85 GB/s
sse42_despace_branchless_lookup(buffer, N)        :  base frequency  3.91 GHz speed:  7.09 GB/s -> 7.85 GB/s

Though it's disappointing that I've made a couple of them slower...

sharpobject avatar Dec 04 '23 15:12 sharpobject

Sorry, I think this needs more work to avoid doing any harm. I'll try to come back to this in a couple days.

sharpobject avatar Dec 04 '23 15:12 sharpobject