Aarch64 LSE 16-byte CAS operation
https://github.com/microsoft/snmalloc/blob/3385660fd7c674c11c74b501fd57646dc50618e2/src/snmalloc/ds/aba.h#L24C13-L24C28
Since we are not using LL/SC abstraction anyway, perhaps we can relax the first condition to defined(__GCC_HAVE_SYNC_COMPARE_AND_SWAP_16) || defined(PLATFORM_IS_X86)
This allows compiler to emit 16b CAS based code on aarch64 platform with LSE.
https://developer.arm.com/documentation/100069/0606/Data-Transfer-Instructions/CASPA--CASPAL--CASP--CASPL--CASPAL--CASP--CASPL?lang=en
Seems reasonable to me. I do wonder if what the cost of just using locks for that would be.
From Nvidia's Grace handbook, it seems that LSE is a lot faster on latest ARM CPUs than before, making faster than LL/SC in many cases.
However, that was not true when LSE was introduced. So older models' CAS operation is slower.
Yet, I don't know how it actually performs when compared to the lock based fallback approach here.