amx icon indicating copy to clipboard operation
amx copied to clipboard

Some tests failing on M4

Open dougallj opened this issue 1 year ago • 2 comments
trafficstars

I had a quick look at the M4 – the tests for EXTRX, EXTRY, VECINT and VECFP are failing.

EXTRX and EXTRY can be fixed with the following change to each:

         if ((AMX_VER >= AMX_VER_M2) && (operand & (1ull << 31))) {
             operand &=~ (0x1ffull << 32);
             z_step = z_col & 32 ? 16 : 32;
         }
+        if ((AMX_VER >= AMX_VER_M4) && (operand & (1ull << 31))) {
+            dst_offset &= ~0x3F;
+        }
         store_enable &= parse_writemask(operand >> 32, xybytes, 9);
     } else if (operand & EXTR_BETWEEN_XY) {

VECINT and VECFP seem to have similar changes – if I only test operands of the form rand_next() & ~(0x3F | (0x3F<<10)) the tests pass. I was able to fix a simple test case by zeroing those bits if bit 31 was set, but that broke indexed-loads. Trying to fix indexed-loads didn't go well, and other experiments imply that that wouldn't be the end of it either. I might be able to work through it, but I figured I'd leave this here in case it's helpful.

Edit: Also, entirely unsurprisingly, Streaming-SVE mode (SME) and AMX are mutually exclusive – if either is enabled, trying to enable the other gives EXC_BAD_INSTRUCTION.

dougallj avatar Jun 01 '24 09:06 dougallj

I'm mildly surprised that AMX instructions are still present at all, given the introduction of SME.

I don't have any M4 hardware to test against at the moment, though I might pick up an M4 MBP when they come out.

corsix avatar Jun 01 '24 11:06 corsix

Yeah, I was surprised too – my initial theory was it was just for software compatibility (within Apple), but I think we're also seeing worse f16/bf16 throughput with SME too, because the spec'd SME operations map less directly to what AMX can do at that size. I might be misremembering the AMX behaviour, or misusing SME, but I've measured single-core SME f16 FLOPS ≈ single-core SME f32 FLOPS (as did someone else https://mastodon.social/@[email protected]/112528651326649755)

dougallj avatar Jun 02 '24 03:06 dougallj

testing on a mac min reveals more tests failing(?):

seems also now FMS16,FMS32 and FMS64 failing too..

will try to check your fixes for EXTRX EXTRY VECINT and VECFP..

./a.out
Testing AMX_LDX... OK   
Testing AMX_LDY... OK   
Testing AMX_LDZ... OK   
Testing AMX_LDZI... OK   
Testing AMX_STX... OK   
Testing AMX_STY... OK   
Testing AMX_STZ... OK   
Testing AMX_STZI... OK   
Testing AMX_EXTRX... Failed on iteration 0.1 (operand 0x2812095d95330561)
Testing AMX_EXTRY... Failed on iteration 0.1 (operand 0x2812095d95330561)
Testing AMX_MAC16... OK   
Testing AMX_FMA16... OK   
Testing AMX_FMA32... OK   
Testing AMX_FMA64... OK   
Testing AMX_FMS16... Failed on iteration 0.49 (operand 0xc5bea1698a3d037f)
Testing AMX_FMS32... Failed on iteration 0.59 (operand 0xf8826e6a0d8d1136)
Testing AMX_FMS64... Failed on iteration 0.404 (operand 0xba36882d4b8f059b)
Testing AMX_VECINT... Failed on iteration 0.17 (operand 0x1021b872b64c6724)
Testing AMX_VECFP... Failed on iteration 0.17 (operand 0x1021b872b64c6724)
Testing AMX_MATINT... OK   
Testing AMX_MATFP... OK   
Testing AMX_GENLUT... OK   

oscarbg avatar Nov 25 '24 05:11 oscarbg

Hmm, yeah, I can reproduce the FMS tests also failing on base model M4 Mac mini. I don't think I overlooked this, but I can't be certain, and I can't rule out changes in how it was being built (iirc I was using Xcode to build/run on iPad).

Assuming the behaviour is different, I wonder if there are silicon changes between M4 iPads and M4 Macs, or if it's a configuration change.

EXTRX/EXTRY fix above still works for those ops.

dougallj avatar Nov 25 '24 07:11 dougallj

Unfortunately I had to update the iPad to latest (18.1.1) before I could re-test, but FMS is failing there too, so it seems likely to either be my mistake, or a configuration change.

dougallj avatar Nov 25 '24 09:11 dougallj

I've just picked up an M4 Max MacBook Pro, on which I see the following:

Testing AMX_LDX... OK   
Testing AMX_LDY... OK   
Testing AMX_LDZ... OK   
Testing AMX_LDZI... OK   
Testing AMX_STX... OK   
Testing AMX_STY... OK   
Testing AMX_STZ... OK   
Testing AMX_STZI... OK   
Testing AMX_EXTRX... Failed on iteration 0.1 (operand 0x2812095d95330561)
Testing AMX_EXTRY... Failed on iteration 0.1 (operand 0x2812095d95330561)
Testing AMX_MAC16... OK   
Testing AMX_FMA16... OK   
Testing AMX_FMA32... OK   
Testing AMX_FMA64... OK   
Testing AMX_FMS16... OK   
Testing AMX_FMS32... OK   
Testing AMX_FMS64... OK   
Testing AMX_VECINT... Failed on iteration 0.17 (operand 0x1021b872b64c6724)
Testing AMX_VECFP... Failed on iteration 0.17 (operand 0x1021b872b64c6724)
Testing AMX_MATINT... OK   
Testing AMX_MATFP... OK   
Testing AMX_GENLUT... OK

I will investigate in due course.

corsix avatar Dec 23 '24 21:12 corsix

... and after rebuilding the test binary (rather than just running what was built on previous MacBook), I also see the FMS failures:

Testing AMX_LDX... OK   
Testing AMX_LDY... OK   
Testing AMX_LDZ... OK   
Testing AMX_LDZI... OK   
Testing AMX_STX... OK   
Testing AMX_STY... OK   
Testing AMX_STZ... OK   
Testing AMX_STZI... OK   
Testing AMX_EXTRX... Failed on iteration 0.1 (operand 0x2812095d95330561)
Testing AMX_EXTRY... Failed on iteration 0.1 (operand 0x2812095d95330561)
Testing AMX_MAC16... OK   
Testing AMX_FMA16... OK   
Testing AMX_FMA32... OK   
Testing AMX_FMA64... OK   
Testing AMX_FMS16... Failed on iteration 0.49 (operand 0xc5bea1698a3d037f)
Testing AMX_FMS32... Failed on iteration 0.59 (operand 0xf8826e6a0d8d1136)
Testing AMX_FMS64... Failed on iteration 0.404 (operand 0xba36882d4b8f059b)
Testing AMX_VECINT... Failed on iteration 0.17 (operand 0x1021b872b64c6724)
Testing AMX_VECFP... Failed on iteration 0.17 (operand 0x1021b872b64c6724)
Testing AMX_MATINT... OK   
Testing AMX_MATFP... OK   
Testing AMX_GENLUT... OK

corsix avatar Dec 23 '24 21:12 corsix

I think the FMS thing is down to a codegen change in newer versions of clang - see f3582ce.

corsix avatar Dec 24 '24 13:12 corsix

I have pushed 483714bb05 with M4 fixes. The conclusion for indexed loads is that some number of bits are masked off the bottom of the X/Y offset fields, with that number being the minimal number such that the loaded index data doesn't straddle any 64-byte register boundaries.

corsix avatar Dec 26 '24 15:12 corsix

I was so pissed off ~~at Apple~~ because I bought a Mac for literally one reason. AMX, more specifically Genlut.

So I applied for a return, but I thought hmm maybe I should try this?

Lucky I did, you saved me. Still gotta run perf all the way but its a great start. Thankyou thankyou thankyou.

I love Apple now.

AMX_VER: 4
Testing AMX_LDX... OK   
Testing AMX_LDY... OK   
Testing AMX_LDZ... OK   
Testing AMX_LDZI... OK   
Testing AMX_STX... OK   
Testing AMX_STY... OK   
Testing AMX_STZ... OK   
Testing AMX_STZI... OK   
Testing AMX_EXTRX... OK   
Testing AMX_EXTRY... OK   
Testing AMX_MAC16... OK   
Testing AMX_FMA16... OK   
Testing AMX_FMA32... OK   
Testing AMX_FMA64... OK   
Testing AMX_FMS16... OK   
Testing AMX_FMS32... OK   
Testing AMX_FMS64... OK   
Testing AMX_VECINT... OK   
Testing AMX_VECFP... OK   
Testing AMX_MATINT... OK   
Testing AMX_MATFP... OK   
Testing AMX_GENLUT... OK 

DanielShuey avatar Aug 28 '25 03:08 DanielShuey

Here it is just in case anyone else comes across this.

On a 10-core M4 iMac

perf_output.txt fma16_mat_f16f16_x*y+z (far z)

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 per thread 2009.4 GFLOPS 2454.8 GFLOPS 3649.5 GFLOPS 3653.2 GFLOPS 3701.1 GFLOPS 4162.0 GFLOPS
2 per thread 3986.7 GFLOPS 4706.4 GFLOPS 4527.6 GFLOPS 4571.6 GFLOPS 4606.0 GFLOPS 4621.4 GFLOPS

So glad Apple is still doing their own thing, I was getting tired of nibble splits and shuffle instructions from 2006. 😄

DanielShuey avatar Sep 01 '25 08:09 DanielShuey