amx
amx copied to clipboard
Some tests failing on M4
I had a quick look at the M4 – the tests for EXTRX, EXTRY, VECINT and VECFP are failing.
EXTRX and EXTRY can be fixed with the following change to each:
if ((AMX_VER >= AMX_VER_M2) && (operand & (1ull << 31))) {
operand &=~ (0x1ffull << 32);
z_step = z_col & 32 ? 16 : 32;
}
+ if ((AMX_VER >= AMX_VER_M4) && (operand & (1ull << 31))) {
+ dst_offset &= ~0x3F;
+ }
store_enable &= parse_writemask(operand >> 32, xybytes, 9);
} else if (operand & EXTR_BETWEEN_XY) {
VECINT and VECFP seem to have similar changes – if I only test operands of the form rand_next() & ~(0x3F | (0x3F<<10)) the tests pass. I was able to fix a simple test case by zeroing those bits if bit 31 was set, but that broke indexed-loads. Trying to fix indexed-loads didn't go well, and other experiments imply that that wouldn't be the end of it either. I might be able to work through it, but I figured I'd leave this here in case it's helpful.
Edit: Also, entirely unsurprisingly, Streaming-SVE mode (SME) and AMX are mutually exclusive – if either is enabled, trying to enable the other gives EXC_BAD_INSTRUCTION.
I'm mildly surprised that AMX instructions are still present at all, given the introduction of SME.
I don't have any M4 hardware to test against at the moment, though I might pick up an M4 MBP when they come out.
Yeah, I was surprised too – my initial theory was it was just for software compatibility (within Apple), but I think we're also seeing worse f16/bf16 throughput with SME too, because the spec'd SME operations map less directly to what AMX can do at that size. I might be misremembering the AMX behaviour, or misusing SME, but I've measured single-core SME f16 FLOPS ≈ single-core SME f32 FLOPS (as did someone else https://mastodon.social/@[email protected]/112528651326649755)
testing on a mac min reveals more tests failing(?):
seems also now FMS16,FMS32 and FMS64 failing too..
will try to check your fixes for EXTRX EXTRY VECINT and VECFP..
./a.out
Testing AMX_LDX... OK
Testing AMX_LDY... OK
Testing AMX_LDZ... OK
Testing AMX_LDZI... OK
Testing AMX_STX... OK
Testing AMX_STY... OK
Testing AMX_STZ... OK
Testing AMX_STZI... OK
Testing AMX_EXTRX... Failed on iteration 0.1 (operand 0x2812095d95330561)
Testing AMX_EXTRY... Failed on iteration 0.1 (operand 0x2812095d95330561)
Testing AMX_MAC16... OK
Testing AMX_FMA16... OK
Testing AMX_FMA32... OK
Testing AMX_FMA64... OK
Testing AMX_FMS16... Failed on iteration 0.49 (operand 0xc5bea1698a3d037f)
Testing AMX_FMS32... Failed on iteration 0.59 (operand 0xf8826e6a0d8d1136)
Testing AMX_FMS64... Failed on iteration 0.404 (operand 0xba36882d4b8f059b)
Testing AMX_VECINT... Failed on iteration 0.17 (operand 0x1021b872b64c6724)
Testing AMX_VECFP... Failed on iteration 0.17 (operand 0x1021b872b64c6724)
Testing AMX_MATINT... OK
Testing AMX_MATFP... OK
Testing AMX_GENLUT... OK
Hmm, yeah, I can reproduce the FMS tests also failing on base model M4 Mac mini. I don't think I overlooked this, but I can't be certain, and I can't rule out changes in how it was being built (iirc I was using Xcode to build/run on iPad).
Assuming the behaviour is different, I wonder if there are silicon changes between M4 iPads and M4 Macs, or if it's a configuration change.
EXTRX/EXTRY fix above still works for those ops.
Unfortunately I had to update the iPad to latest (18.1.1) before I could re-test, but FMS is failing there too, so it seems likely to either be my mistake, or a configuration change.
I've just picked up an M4 Max MacBook Pro, on which I see the following:
Testing AMX_LDX... OK
Testing AMX_LDY... OK
Testing AMX_LDZ... OK
Testing AMX_LDZI... OK
Testing AMX_STX... OK
Testing AMX_STY... OK
Testing AMX_STZ... OK
Testing AMX_STZI... OK
Testing AMX_EXTRX... Failed on iteration 0.1 (operand 0x2812095d95330561)
Testing AMX_EXTRY... Failed on iteration 0.1 (operand 0x2812095d95330561)
Testing AMX_MAC16... OK
Testing AMX_FMA16... OK
Testing AMX_FMA32... OK
Testing AMX_FMA64... OK
Testing AMX_FMS16... OK
Testing AMX_FMS32... OK
Testing AMX_FMS64... OK
Testing AMX_VECINT... Failed on iteration 0.17 (operand 0x1021b872b64c6724)
Testing AMX_VECFP... Failed on iteration 0.17 (operand 0x1021b872b64c6724)
Testing AMX_MATINT... OK
Testing AMX_MATFP... OK
Testing AMX_GENLUT... OK
I will investigate in due course.
... and after rebuilding the test binary (rather than just running what was built on previous MacBook), I also see the FMS failures:
Testing AMX_LDX... OK
Testing AMX_LDY... OK
Testing AMX_LDZ... OK
Testing AMX_LDZI... OK
Testing AMX_STX... OK
Testing AMX_STY... OK
Testing AMX_STZ... OK
Testing AMX_STZI... OK
Testing AMX_EXTRX... Failed on iteration 0.1 (operand 0x2812095d95330561)
Testing AMX_EXTRY... Failed on iteration 0.1 (operand 0x2812095d95330561)
Testing AMX_MAC16... OK
Testing AMX_FMA16... OK
Testing AMX_FMA32... OK
Testing AMX_FMA64... OK
Testing AMX_FMS16... Failed on iteration 0.49 (operand 0xc5bea1698a3d037f)
Testing AMX_FMS32... Failed on iteration 0.59 (operand 0xf8826e6a0d8d1136)
Testing AMX_FMS64... Failed on iteration 0.404 (operand 0xba36882d4b8f059b)
Testing AMX_VECINT... Failed on iteration 0.17 (operand 0x1021b872b64c6724)
Testing AMX_VECFP... Failed on iteration 0.17 (operand 0x1021b872b64c6724)
Testing AMX_MATINT... OK
Testing AMX_MATFP... OK
Testing AMX_GENLUT... OK
I think the FMS thing is down to a codegen change in newer versions of clang - see f3582ce.
I have pushed 483714bb05 with M4 fixes. The conclusion for indexed loads is that some number of bits are masked off the bottom of the X/Y offset fields, with that number being the minimal number such that the loaded index data doesn't straddle any 64-byte register boundaries.
I was so pissed off ~~at Apple~~ because I bought a Mac for literally one reason. AMX, more specifically Genlut.
So I applied for a return, but I thought hmm maybe I should try this?
Lucky I did, you saved me. Still gotta run perf all the way but its a great start. Thankyou thankyou thankyou.
I love Apple now.
AMX_VER: 4
Testing AMX_LDX... OK
Testing AMX_LDY... OK
Testing AMX_LDZ... OK
Testing AMX_LDZI... OK
Testing AMX_STX... OK
Testing AMX_STY... OK
Testing AMX_STZ... OK
Testing AMX_STZI... OK
Testing AMX_EXTRX... OK
Testing AMX_EXTRY... OK
Testing AMX_MAC16... OK
Testing AMX_FMA16... OK
Testing AMX_FMA32... OK
Testing AMX_FMA64... OK
Testing AMX_FMS16... OK
Testing AMX_FMS32... OK
Testing AMX_FMS64... OK
Testing AMX_VECINT... OK
Testing AMX_VECFP... OK
Testing AMX_MATINT... OK
Testing AMX_MATFP... OK
Testing AMX_GENLUT... OK
Here it is just in case anyone else comes across this.
On a 10-core M4 iMac
perf_output.txt fma16_mat_f16f16_x*y+z (far z)
| Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
|---|---|---|---|---|---|---|
| 1 per thread | 2009.4 GFLOPS | 2454.8 GFLOPS | 3649.5 GFLOPS | 3653.2 GFLOPS | 3701.1 GFLOPS | 4162.0 GFLOPS |
| 2 per thread | 3986.7 GFLOPS | 4706.4 GFLOPS | 4527.6 GFLOPS | 4571.6 GFLOPS | 4606.0 GFLOPS | 4621.4 GFLOPS |
So glad Apple is still doing their own thing, I was getting tired of nibble splits and shuffle instructions from 2006. 😄