Aarch64 LSE 16-byte CAS operation

Open SchrodingerZhu opened this issue 4 months ago • 2 comments

https://github.com/microsoft/snmalloc/blob/3385660fd7c674c11c74b501fd57646dc50618e2/src/snmalloc/ds/aba.h#L24C13-L24C28

Since we are not using LL/SC abstraction anyway, perhaps we can relax the first condition to defined(__GCC_HAVE_SYNC_COMPARE_AND_SWAP_16) || defined(PLATFORM_IS_X86)

This allows compiler to emit 16b CAS based code on aarch64 platform with LSE.

https://developer.arm.com/documentation/100069/0606/Data-Transfer-Instructions/CASPA--CASPAL--CASP--CASPL--CASPAL--CASP--CASPL?lang=en

Aug 10 '25 01:08 SchrodingerZhu

Seems reasonable to me. I do wonder if what the cost of just using locks for that would be.

Aug 18 '25 12:08 mjp41

From Nvidia's Grace handbook, it seems that LSE is a lot faster on latest ARM CPUs than before, making faster than LL/SC in many cases.

However, that was not true when LSE was introduced. So older models' CAS operation is slower.

Yet, I don't know how it actually performs when compared to the lock based fallback approach here.

Aug 18 '25 12:08 SchrodingerZhu