despacer
despacer copied to clipboard
Avoid unnecessary sign-extending instructions
Use unsigned types to store the result of popcnt and movemask because otherwise we will get a movsx to sign-extend these values (when we subsequently use them as indices into an array or whatever), which is unnecessary in almost all cases and incorrect if it ever does anything
Implementations actually affected by this patch seem to be these on my Haswell server:
avx2_despace_branchless(buffer, N) : base frequency 3.91 GHz speed: 10.80 GB/s -> 11.08 GB/s
avx2_despace_branchless(buffer, N) : base frequency 3.91 GHz speed: 10.83 GB/s -> 11.07 GB/s
avx2_despace_branchless(buffer, N) : base frequency 3.91 GHz speed: 10.88 GB/s -> 11.03 GB/s
sse4_despace_branchless_u2(buffer, N) : base frequency 3.91 GHz speed: 8.71 GB/s -> 8.50 GB/s
sse4_despace_branchless_u2(buffer, N) : base frequency 3.91 GHz speed: 8.67 GB/s -> 8.48 GB/s
sse4_despace_branchless_u2(buffer, N) : base frequency 3.91 GHz speed: 8.67 GB/s -> 8.48 GB/s
sse4_despace_branchless_u4(buffer, N) : base frequency 3.91 GHz speed: 8.77 GB/s -> 8.47 GB/s
sse4_despace_branchless_u4(buffer, N) : base frequency 3.91 GHz speed: 8.77 GB/s -> 8.50 GB/s
sse4_despace_branchless_u4(buffer, N) : base frequency 3.91 GHz speed: 8.74 GB/s -> 8.36 GB/s
sse4_despace_skinny_u4(buffer, N) : base frequency 3.91 GHz speed: 7.56 GB/s -> 7.72 GB/s
sse4_despace_skinny_u4(buffer, N) : base frequency 3.91 GHz speed: 7.56 GB/s -> 7.80 GB/s
sse4_despace_skinny_u4(buffer, N) : base frequency 3.91 GHz speed: 7.56 GB/s -> 7.69 GB/s
sse42_despace_branchless(buffer, N) : base frequency 3.91 GHz speed: 7.82 GB/s -> 7.85 GB/s
sse42_despace_branchless(buffer, N) : base frequency 3.91 GHz speed: 7.82 GB/s -> 7.85 GB/s
sse42_despace_branchless(buffer, N) : base frequency 3.91 GHz speed: 7.82 GB/s -> 7.85 GB/s
sse42_despace_branchless_lookup(buffer, N) : base frequency 3.91 GHz speed: 7.09 GB/s -> 7.85 GB/s
sse42_despace_branchless_lookup(buffer, N) : base frequency 3.91 GHz speed: 7.09 GB/s -> 7.85 GB/s
sse42_despace_branchless_lookup(buffer, N) : base frequency 3.91 GHz speed: 7.09 GB/s -> 7.85 GB/s
Though it's disappointing that I've made a couple of them slower...
Sorry, I think this needs more work to avoid doing any harm. I'll try to come back to this in a couple days.