glaze icon indicating copy to clipboard operation
glaze copied to clipboard

prof of concept auto avx512 vectorization

Open arturbac opened this issue 1 year ago • 3 comments

DONT'T merge just example prof of concept for #1591 for output of json_performance on Ryzen 9 9950X -march=znver5

normal code with avx2 code block enabled on znver5

Glaze json write: 0.338783 s, 1886.05 MB/s Glaze json write: 0.337576 s, 1912.27 MB/s Glaze json write: 0.231021 s, 3690.51 MB/s

auto vectorized 512 (except movemask_64 ) Glaze json write: 0.318704 s, 2004.88 MB/s Glaze json write: 0.343197 s, 1880.95 MB/s Glaze json write: 0.219827 s, 3878.44 MB/s

arturbac avatar Feb 04 '25 21:02 arturbac

The vector extensions only work for Clang and GCC, right? How would you recommend supporting MSVC?

stephenberry avatar Feb 05 '25 18:02 stephenberry

The vector extensions only work for Clang and GCC, right? How would you recommend supporting MSVC?

As i mentioned earlier - wrap and simulate

https://devblogs.microsoft.com/cppblog/avx-512-auto-vectorization-in-msvc/

#if defined(__clang__) 
    using uint64x8_t = uint64_t __attribute__((__vector_size__(64)));


    inline auto as_uint64x8( __m512i v ) -> uint64x8_t { return v; }
    inline auto as_m512i( uint64x8_t v ) -> __m512i { return v; }

#elif defined(__GNUC__)
    using uint64x8_t = uint64_t __attribute__((__vector_size__(64)));
    inline auto as_uint64x8( __m512i v ) -> uint64x8_t { return (uint64x8_t)v; }
    inline auto as_m512i( uint64x8_t v ) -> __m512i { return (__m512i)v; }
#else
union uint64x8_t // or memcpy
{
 alignas(64) uint64_t value[8];
 __m512i v512;
};
[[msvc::forceinline]]
inline auto as_uint64x8( __m512i v ) -> uint64x8_t 
    { 
        uint64x8_t res;
        res.v512 = v;
        return res;
         }
[[msvc::forceinline]]
inline auto as_m512i( uint64x8_t v ) -> __m512i { 
    return v.v512;
     }

auto operator &( uint64x8_t a, uint64x8_t b ) 
    {
    uint64x8_t res;
   res.v512 =_mm512_and_epi64(a.v512,b.v512);
   return res;
  }
#endif 

arturbac avatar Feb 05 '25 19:02 arturbac

I am not sure it is worth doing so. When I checked on znver5 generic uint64 version performance it is almost same as optimized versions some cases better some worse.

Glaze json write: 0.336473 s, 1899 MB/s Glaze json write: 0.390721 s, 1650.43 MB/s Glaze json write: 0.200108 s, 4260.62 MB/s

arturbac avatar Feb 08 '25 18:02 arturbac