sse2neon icon indicating copy to clipboard operation
sse2neon copied to clipboard

Improve _mm_popcnt_*

Open jserv opened this issue 5 years ago • 1 comments

Quote from Jukka Liimatta

mm_popcnt* uses store, when vget_lane_* would probably be a better fit.. the compiler will optimize the store into lane extract more likely but now it can go either way. The 32 bit load reads 64 bits from 32 bit variable.. that should be fixed. vcreate_u8 would be safer anyway. the vrev64q_u32 handled the lo/hi case. The load/store in _mm_popcnt might warrant a second look.

Source: https://twitter.com/JukkaLiimatta/status/1276540448245415936

jserv avatar Jun 26 '20 16:06 jserv

sse-popcount provides several popcount implementations along with comprehensive benchmarking. Arm NEON included.

jserv avatar Oct 09 '21 14:10 jserv

Per commit df9b58d283d1ad0fcfa6246225bda6ab5eae2ea6, we stick to the popcount implementation provided by GNU toolchain.

jserv avatar Dec 26 '22 05:12 jserv