sse2neon
                                
                                
                                
                                    sse2neon copied to clipboard
                            
                            
                            
                        Improve _mm_popcnt_*
Quote from Jukka Liimatta
mm_popcnt* uses store, when vget_lane_* would probably be a better fit.. the compiler will optimize the store into lane extract more likely but now it can go either way. The 32 bit load reads 64 bits from 32 bit variable.. that should be fixed. vcreate_u8 would be safer anyway. the vrev64q_u32 handled the lo/hi case. The load/store in _mm_popcnt might warrant a second look.
Source: https://twitter.com/JukkaLiimatta/status/1276540448245415936
sse-popcount provides several popcount implementations along with comprehensive benchmarking. Arm NEON included.
Per commit df9b58d283d1ad0fcfa6246225bda6ab5eae2ea6, we stick to the popcount implementation provided by GNU toolchain.