Core dump when using vector instructions inside a lima VM on a Mac M4
Description
Context
- Lima version: 1.0.6
- Host OS: Sequoia 15.4
- Host system: Apple M4 Pro, 24 GB
- Default settings
Description
When inside a lima VM, I try to use the python JAX library, which has a JIT compiler that tries to use the available vector instructions. On the host system, as well as inside an orbstack VM, this works without issue, but inside the lima VM I get a core dump.
Reproduction steps
limactl start
lima
# Setup
sudo apt update
sudo apt install pipx gdb
pipx install uv
pipx ensurepath
source ~/.bashrc
mkdir ~/test && cd ~/test
uv init
uv add jax[cpu]
# Causing the crash
ulimit -c unlimited
source .venv/bin/activate
python3
>>> import jax.numpy as jnp
>>> jnp.arange(2)
# Checking the core dump
gdb python3 core
(gdb) x/10i $pc
=> 0xf2606c4d7008: index z0.s, #0, #1
0xf2606c4d700c: ldr x8, [x0, #24]
0xf2606c4d7010: mov x0, xzr
0xf2606c4d7014: ldr x8, [x8]
0xf2606c4d7018: str d0, [x8]
0xf2606c4d701c: ldp x29, x30, [sp], #16
0xf2606c4d7020: ret
0xf2606c4d7024: udf #0
0xf2606c4d7028: udf #0
0xf2606c4d702c: udf #0
More information
I tried to do the same thing inside an orbstack VM, and I do not get a core dump.
I tried to do the same thing inside a lima VM on an M3 machine, and I do not get a core dump.
On the M4 system, when I compare lscpu between lima and orbstack, it seems like inside the lima VM the host system's vector instructions are exposed (but cause a core dump when used), while inside the orbstack VM, the vector instructions are not exposed.
Lima:
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: Apple
Model name: -
Model: 0
Thread(s) per core: 1
Core(s) per cluster: 4
Socket(s): -
Cluster(s): 1
Stepping: 0x0
BogoMIPS: 48.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 asimddp sha512 asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 flagm2 frint svei8mm svebf16 bf16 afp sme smei16i64 smef64f64 smei8
i32 smef16f32 smeb16f32 smef32f32 sme2 smei16i32 smebi32i32
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-3
Orbstack:
$ orbctl run lscpu
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Vendor ID: Apple
Model name: -
Model: 0
Thread(s) per core: 1
Core(s) per cluster: 12
Socket(s): -
Cluster(s): 1
Stepping: 0x0
CPU(s) scaling MHz: 100%
CPU max MHz: 2000.0000
CPU min MHz: 2000.0000
BogoMIPS: 48.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 asimddp sha512 asimdfhm dit uscat ilrcpc flagm sb dcpodp flagm2 frint bf16 afp
With vz or QEMU?
Specifying the other one in limactl create --vm-type=... may work?
@AkihiroSuda I tried with with VM types:
~ limactl list
NAME STATUS SSH VMTYPE ARCH CPUS MEMORY DISK DIR
default Stopped 127.0.0.1:0 vz aarch64 4 4GiB 100GiB ~/.lima/default
qemu Running 127.0.0.1:49819 qemu aarch64 4 4GiB 100GiB ~/.lima/qemu
And in both cases I get a core dump if I use default settings. The output of lscpu posted in the ticket was with vmType=vz, here's what I get with qemu:
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: ARM
Model name: Cortex-A76
Model: 1
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Stepping: r4p1
BogoMIPS: 48.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp dit sve2 svei8mm svebf16 smei16i64 smef64f64 smei8i32 smef16f32 smeb16f32 smef32f32 sme2 smei16i32 smebi32i32
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-3
I believe if I can configure QEMU to not use SVE2 instructions, that could workaround the issue. Is there documentation available on the schema of the lima configuration yaml?
Is there documentation available on the schema of the lima configuration yaml?
QEMU's -cpu string can be specified in .cpuType
https://github.com/lima-vm/lima/blob/0625d0b084450e874869dcbc9f63d4312797c3fe/templates/centos-stream-10.yaml#L31-L35
For Intel we get a SIGILL the first time that AVX-512 is used on Darwin, and then you have to enable it on the host CPU thread (outside of the VM)
Is there something similar happening here with SVE-2, or what is in the core dump? The solution could be the same, mask some instructions...
- https://github.com/lima-vm/lima/pull/3065
cpuType[arch] += ",-sve2"
Does cpuinfo report that the SVE instructions are available?
go run github.com/klauspost/cpuid/v2/cmd/cpuid@latest
Does SVE2 code work on Darwin, or is it not implemented?
https://developer.arm.com/documentation/102340/0100/Program-with-SVE2
Hi @afbjorklund, @AkihiroSuda thanks for helping me looking into this!
Does cpuinfo report that the SVE instructions are available?
According to its wikipedia entry, M4 does not implement SVE: https://en.wikipedia.org/wiki/Apple_M4
Running cpuid gives:
Name: Apple M4 Pro
Vendor String: Apple
Vendor ID: VendorUnknown
PhysicalCores: 12
Threads Per Core: 1
Logical Cores: 12
CPU Family 399882554 Model: 0 Stepping: 0
Features: AESARM,ASIMDDP,ASIMDRDM,ATOMICS,CRC32,DCPOP,FCMA,FHM,FP,FPHP,GPA,JSCVT,LRCPC,PMULL,TS,SHA1,SHA2,SHA3,SHA512
Microarchitecture level: 0
Cacheline bytes: 128
L1 Instruction Cache: 131072 bytes
L1 Data Cache: 65536 bytes
L2 Cache: 4194304 bytes
L3 Cache: -1 bytes
cpuType[arch] += ",-sve2"
I've tried the following configuration:
cpuType:
aarch64: "cortex-a76,-sve2"
but I think it expects cpu features to be supplied like key=value. Verified by trying to start qemu directly:
> qemu-system-aarch64 -m 4096 -cpu cortex-a76,-sve -machine virt,accel=hvf
qemu-system-aarch64: Expected key=value format, found -sve.
> qemu-system-aarch64 -m 4096 -cpu cortex-a76,sve=off -machine virt,accel=hvf
qemu-system-aarch64: can't apply global cortex-a76-arm-cpu.sve=off: Property 'cortex-a76-arm-cpu.sve' not found
So now it seems like the syntax is right, but the sve feature cannot be set. I tried to figure out what features are available for toggling on the different CPU choices, but it doesn't seem like the QEMU version shipped with lima includes query-cpu-model-expansion (see: https://qemu-project.gitlab.io/qemu/system/arm/cpu-features.html)
If you have any further suggestions how to proceed, I can try a couple more things!
Hi @AkihiroSuda @afbjorklund I'd be happy to continue debugging this, but I'm not sure what other things to try.