mlx Segfaults when running examples using GPU inside a VM

When running mlx-examples/mnist on MacOS 14.1.1 VM running via Parallels 19.1.1:

user@Users-Virtual-Machine mnist % python main.py --gpu
*** Terminating app due to uncaught exception 'NSInvalidArgumentException', reason: '-[AppleParavirtDevice newArgumentEncoderWithLayout:]: unrecognized selector sent to instance 0x149029e00'
*** First throw call stack:
(
	0   CoreFoundation                      0x000000018a682800 __exceptionPreprocess + 176
	1   libobjc.A.dylib                     0x000000018a179eb4 objc_exception_throw + 60
	2   CoreFoundation                      0x000000018a7343bc -[NSObject(NSObject) __retain_OA] + 0
	3   AppleParavirtGPUMetalIOGPUFamily    0x0000000101fe1118 doUncompressedBlit + 11160
	4   Metal                               0x0000000194729184 -[_MTLDevice newArgumentEncoderWithArguments:structType:] + 136
	5   libmlx.dylib                        0x00000001113f660c _ZN3mlx4core6Gather8eval_gpuERKNSt3__16vectorINS0_5arrayENS2_9allocatorIS4_EEEERS4_ + 1380
	6   libmlx.dylib                        0x00000001113fc54c _ZNSt3__110__function6__funcIZN3mlx4core5metal9make_taskERNS3_5arrayENS_6vectorINS_13shared_futureIvEENS_9allocatorIS9_EEEENS_10shared_ptrINS_7promiseIvEEEEbE3$_2NSA_ISH_EEFvvEEclEv + 148
	7   libmlx.dylib                        0x0000000110d5ff14 _ZN3mlx4core9scheduler12StreamThread9thread_fnEv + 500
	8   libmlx.dylib                        0x0000000110d600d0 _ZNSt3__114__thread_proxyB7v160006INS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEEMN3mlx4core9scheduler12StreamThreadEFvvEPSA_EEEEEPvSF_ + 72
	9   libsystem_pthread.dylib             0x000000018a531034 _pthread_start + 136
	10  libsystem_pthread.dylib             0x000000018a52be3c thread_start + 8
)
libc++abi: terminating due to uncaught exception of type NSException
zsh: abort      python main.py --gpu

I've had similar problems trying to use MPS inside a VM. Is there any plans to support the use of Metal inside VMs?

Dec 06 '23 16:12 jerpeter1

I'll admit I don't have any experience using MacOS VMs on Parallels - do you have access to AppleSilicon GPUs on your VM ? Running the command system_profiler SPDisplaysDataType might help us figure out if there is a GPU with metal support

If there isn't metal support, then I'm afraid we won't be able to help you much further

Dec 06 '23 16:12 jagrit06

system_profiler SPDisplaysDataType doesn't report anything. Running llama.cpp with MPS support compiled in, it reports this:

ggml_metal_init: allocating
ggml_metal_init: found device: Apple Paravirtual device
ggml_metal_init: picking default device: Apple Paravirtual device
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading '/Users/user/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple Paravirtual device
ggml_metal_init: GPU family: MTLGPUFamilyApple5 (1005)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  =  1024.00 MiB
ggml_metal_init: maxTransferRate               = built-in GPU

Very basic usage of Metal, e.g. https://github.com/neurolabusc/Metal/blob/main/minimal run as expected, but more advanced examples e.g. https://github.com/neurolabusc/Metal/tree/main/mmul have errors (in that particular example, the MPSMultiplication results are all 0).

Dec 06 '23 17:12 jerpeter1

is this a VM running on Intel or Apple Silicon? your report says "hasUnifiedMemory: true" so, i'm assuming it's silicon?

Dec 07 '23 03:12 nullhook

Yes, it's a VM running on a M2 Pro.

Dec 07 '23 13:12 jerpeter1

Oh, if llama.cpp works, then it must just be a missing ArgumentEncoder function for virtual devices down a Metal Framework level rather than something bigger like I was worried about Let me see if I can look into it any further, but since this is something on the Metal Framework level, I wouldn't expect a quick update

That said, we only use Metal Argument Encoders for Gather and Scatter primitives to do multi-dimensional indexing. It is needed since we have a container that holds multiple device buffers of indices that are all to be used by the kernel. There is a possibility that someone can write a few simple cases of the Gather and Scatter primitives without a Metal Argument Buffer by just simply unrolling those containers into a different arguments For anyone interested in looking into that, these 2 files would a starting point: https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/indexing.cpp https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/indexing.metal

Dec 07 '23 16:12 jagrit06

it could be possible metal argument buffers aren't supported in a VM environment? a simple check could justify that:

device->argumentBuffersSupport() device must be a tier2 to support argument buffers.

Dec 07 '23 17:12 nullhook

I've confirmed that argumentBuffersSupport is reporting 0 in my MacOS VM.

Dec 08 '23 19:12 jerpeter1

if llama.cpp works

I wouldn't say that llama.cpp works - it detects the paravirtualized GPU, but it reports errors when attempting to offload any layers onto the GPU.

Dec 08 '23 19:12 jerpeter1

if llama.cpp works

I wouldn't say that llama.cpp works - it detects the paravirtualized GPU, but it reports errors when attempting to offload any layers onto the GPU.

In that case, unfortunately, it might be a larger issue of metal support on virtual devices - are you aware if a simple metal program is able to run on the machine ? Something like this maybe: https://developer.apple.com/documentation/metal/performing_calculations_on_a_gpu?language=objc

Dec 08 '23 19:12 jagrit06

if its reporting 0 then its Tier1 and most probably arguments buffer apis won't work. well, it may work but there's just a lot of limitations

Dec 10 '23 03:12 nullhook

are you aware if a simple metal program is able to run on the machine ? Something like this maybe: https://developer.apple.com/documentation/metal/performing_calculations_on_a_gpu?language=objc

Yes, that program runs successfully:

user@Users-Virtual-Machine MetalComputeBasic % ./MetalComputeBasic 
Compute results as expected
2023-12-11 10:11:22.711 MetalComputeBasic[656:4830] Execution finished

Dec 11 '23 15:12 jerpeter1

Yes, that program runs successfully:

Increasing the amount of memory usage makes this sample program fail in the VM (e.g. from const unsigned long arrayLength = 1 << 24; to const unsigned long arrayLength = 1 << 27;:

Compute ERROR: index=0 result=0 vs 0.216261=a+b
Assertion failed: (result[index] == (a[index] + b[index])), function -[MetalAdder verifyResults], file MetalAdder.m, line 158.

Dec 11 '23 15:12 jerpeter1

We should be able to run on virtual devices once #683 lands!

Feb 14 '24 00:02 jagrit06

mlx mlx copied to clipboard

Segfaults when running examples using GPU inside a VM

mlx
mlx copied to clipboard