mlx
mlx copied to clipboard
Segfaults when running examples using GPU inside a VM
When running mlx-examples/mnist on MacOS 14.1.1 VM running via Parallels 19.1.1:
user@Users-Virtual-Machine mnist % python main.py --gpu
*** Terminating app due to uncaught exception 'NSInvalidArgumentException', reason: '-[AppleParavirtDevice newArgumentEncoderWithLayout:]: unrecognized selector sent to instance 0x149029e00'
*** First throw call stack:
(
0 CoreFoundation 0x000000018a682800 __exceptionPreprocess + 176
1 libobjc.A.dylib 0x000000018a179eb4 objc_exception_throw + 60
2 CoreFoundation 0x000000018a7343bc -[NSObject(NSObject) __retain_OA] + 0
3 AppleParavirtGPUMetalIOGPUFamily 0x0000000101fe1118 doUncompressedBlit + 11160
4 Metal 0x0000000194729184 -[_MTLDevice newArgumentEncoderWithArguments:structType:] + 136
5 libmlx.dylib 0x00000001113f660c _ZN3mlx4core6Gather8eval_gpuERKNSt3__16vectorINS0_5arrayENS2_9allocatorIS4_EEEERS4_ + 1380
6 libmlx.dylib 0x00000001113fc54c _ZNSt3__110__function6__funcIZN3mlx4core5metal9make_taskERNS3_5arrayENS_6vectorINS_13shared_futureIvEENS_9allocatorIS9_EEEENS_10shared_ptrINS_7promiseIvEEEEbE3$_2NSA_ISH_EEFvvEEclEv + 148
7 libmlx.dylib 0x0000000110d5ff14 _ZN3mlx4core9scheduler12StreamThread9thread_fnEv + 500
8 libmlx.dylib 0x0000000110d600d0 _ZNSt3__114__thread_proxyB7v160006INS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEEMN3mlx4core9scheduler12StreamThreadEFvvEPSA_EEEEEPvSF_ + 72
9 libsystem_pthread.dylib 0x000000018a531034 _pthread_start + 136
10 libsystem_pthread.dylib 0x000000018a52be3c thread_start + 8
)
libc++abi: terminating due to uncaught exception of type NSException
zsh: abort python main.py --gpu
I've had similar problems trying to use MPS inside a VM. Is there any plans to support the use of Metal inside VMs?
I'll admit I don't have any experience using MacOS VMs on Parallels - do you have access to AppleSilicon GPUs on your VM ?
Running the command system_profiler SPDisplaysDataType might help us figure out if there is a GPU with metal support
If there isn't metal support, then I'm afraid we won't be able to help you much further
system_profiler SPDisplaysDataType doesn't report anything. Running llama.cpp with MPS support compiled in, it reports this:
ggml_metal_init: allocating
ggml_metal_init: found device: Apple Paravirtual device
ggml_metal_init: picking default device: Apple Paravirtual device
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading '/Users/user/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple Paravirtual device
ggml_metal_init: GPU family: MTLGPUFamilyApple5 (1005)
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 1024.00 MiB
ggml_metal_init: maxTransferRate = built-in GPU
Very basic usage of Metal, e.g. https://github.com/neurolabusc/Metal/blob/main/minimal run as expected, but more advanced examples e.g. https://github.com/neurolabusc/Metal/tree/main/mmul have errors (in that particular example, the MPSMultiplication results are all 0).
is this a VM running on Intel or Apple Silicon? your report says "hasUnifiedMemory: true" so, i'm assuming it's silicon?
Yes, it's a VM running on a M2 Pro.
Oh, if llama.cpp works, then it must just be a missing ArgumentEncoder function for virtual devices down a Metal Framework level rather than something bigger like I was worried about Let me see if I can look into it any further, but since this is something on the Metal Framework level, I wouldn't expect a quick update
That said, we only use Metal Argument Encoders for Gather and Scatter primitives to do multi-dimensional indexing. It is needed since we have a container that holds multiple device buffers of indices that are all to be used by the kernel. There is a possibility that someone can write a few simple cases of the Gather and Scatter primitives without a Metal Argument Buffer by just simply unrolling those containers into a different arguments For anyone interested in looking into that, these 2 files would a starting point: https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/indexing.cpp https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/indexing.metal
it could be possible metal argument buffers aren't supported in a VM environment? a simple check could justify that:
device->argumentBuffersSupport() device must be a tier2 to support argument buffers.
I've confirmed that argumentBuffersSupport is reporting 0 in my MacOS VM.
if llama.cpp works
I wouldn't say that llama.cpp works - it detects the paravirtualized GPU, but it reports errors when attempting to offload any layers onto the GPU.
if llama.cpp works
I wouldn't say that llama.cpp works - it detects the paravirtualized GPU, but it reports errors when attempting to offload any layers onto the GPU.
In that case, unfortunately, it might be a larger issue of metal support on virtual devices - are you aware if a simple metal program is able to run on the machine ? Something like this maybe: https://developer.apple.com/documentation/metal/performing_calculations_on_a_gpu?language=objc
if its reporting 0 then its Tier1 and most probably arguments buffer apis won't work. well, it may work but there's just a lot of limitations
are you aware if a simple metal program is able to run on the machine ? Something like this maybe: https://developer.apple.com/documentation/metal/performing_calculations_on_a_gpu?language=objc
Yes, that program runs successfully:
user@Users-Virtual-Machine MetalComputeBasic % ./MetalComputeBasic
Compute results as expected
2023-12-11 10:11:22.711 MetalComputeBasic[656:4830] Execution finished
Yes, that program runs successfully:
Increasing the amount of memory usage makes this sample program fail in the VM (e.g. from const unsigned long arrayLength = 1 << 24; to const unsigned long arrayLength = 1 << 27;:
Compute ERROR: index=0 result=0 vs 0.216261=a+b
Assertion failed: (result[index] == (a[index] + b[index])), function -[MetalAdder verifyResults], file MetalAdder.m, line 158.
We should be able to run on virtual devices once #683 lands!