mlx [BUG] Distributed inference OOMs on machines with different RAM size

Describe the bug Running distributed inference of DeepSeek-R1-3bit on three M2 Ultra machines fails.

Desktop:

OS Version: macOS Sequoia 15.2
mpirun (Open MPI) 4.1.1 installed via conda
Python 3.12.7 | packaged by Anaconda, Inc
Version: 0.22.0 built from source at e6a7ab967530866eb89c013f833f7c525bec10ca
machine1: 192GB M2 Ultra
machine2: 128GB M2 Ultra
machine3: 64GB M2 Ultra
Connectivity: 1Gb Ethernet switch connected to Ethernet 1 of all devices.

To Reproduce

Follow instructions in this gist, including setting the GPU limit of each machine to ~80% of its capacity.
Launch inference as above.

mpirun -np 3 --hostfile hosts.txt /path/to/anaconda/python3 /path/to/pipeline_generate.py --prompt "Hello world"

hosts.txt:

machine1 slots=1 
machine2 slots=1
machine3 slots=1

Actual behavior

Machines 1 and 2 (192GB and 128GB memory) load about 105GB of weights without using any swap. They have ~90GB and 25GB remaining.
Machine 3 (with 64 GB memory) hits 58.5 GB RAM utilization, does not stop.
Machine 1 shows an error:

libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)

prterun then exits with an error on Machine 1.
Machines 2 and 3 hold ~100GB in RAM and/or swap for another 30 seconds before exiting.

Expected behavior Each machine loads up about 80-90% of its memory with weights and does not OOM. Inference eventually runs and produces tokens.

Additional context MPI log:

(base) user@machine1 ~ % mpirun -np 3 --hostfile hosts.txt /opt/homebrew/anaconda3/bin/python3 /Users/user/deepseek/pipeline_generate.py --prompt "Hello world"
Fetching 75 files: 100%|██████████| 75/75 [00:00<00:00, 27562.67it/s]
Fetching 75 files: 100%|██████████| 75/75 [00:00<00:00, 41342.20it/s]
Fetching 75 files: 100%|██████████| 75/75 [00:00<00:00, 34166.70it/s]
[WARNING] Generating with a model that requires 100534 MB which is close to the maximum recommended size of 48000 MB. This can be slow. See the documentation for possible work-arounds: https://github.com/ml-explore/mlx-examples/tree/main/llms#large-models
[WARNING] Generating with a model that requires 116168 MB which is close to the maximum recommended size of 110000 MB. This can be slow. See the documentation for possible work-arounds: https://github.com/ml-explore/mlx-examples/tree/main/llms#large-models
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
[machine1:42301] *** Process received signal ***
[machine1:42301] Signal: Abort trap: 6 (6)
[machine1:42301] Signal code:  (0)
[machine1:42301] [ 0] 0   libsystem_platform.dylib            0x0000000198072e04 _sigtramp + 56
[machine1:42301] [ 1] 0   libsystem_pthread.dylib             0x000000019803bf70 pthread_kill + 288
[machine1:42301] [ 2] 0   libsystem_c.dylib                   0x0000000197f48908 abort + 128
[machine1:42301] [ 3] 0   libc++abi.dylib                     0x0000000197ff244c _ZN10__cxxabiv130__aligned_malloc_with_fallbackEm + 0
[machine1:42301] [ 4] 0   libc++abi.dylib                     0x0000000197fe0a24 _ZL28demangling_terminate_handlerv + 320
[machine1:42301] [ 5] 0   libobjc.A.dylib                     0x0000000197c893f4 _ZL15_objc_terminatev + 172
[machine1:42301] [ 6] 0   libc++abi.dylib                     0x0000000197ff1710 _ZSt11__terminatePFvvE + 16
[machine1:42301] [ 7] 0   libc++abi.dylib                     0x0000000197ff16b4 _ZSt9terminatev + 108
[machine1:42301] [ 8] 0   libdispatch.dylib                   0x0000000197e89688 _dispatch_client_callout4 + 40
[machine1:42301] [ 9] 0   libdispatch.dylib                   0x0000000197ea5c88 _dispatch_mach_msg_invoke + 464
[machine1:42301] [10] 0   libdispatch.dylib                   0x0000000197e90a38 _dispatch_lane_serial_drain + 352
[machine1:42301] [11] 0   libdispatch.dylib                   0x0000000197ea69dc _dispatch_mach_invoke + 456
[machine1:42301] [12] 0   libdispatch.dylib                   0x0000000197e90a38 _dispatch_lane_serial_drain + 352
[machine1:42301] [13] 0   libdispatch.dylib                   0x0000000197e91764 _dispatch_lane_invoke + 432
[machine1:42301] [14] 0   libdispatch.dylib                   0x0000000197e90a38 _dispatch_lane_serial_drain + 352
[machine1:42301] [15] 0   libdispatch.dylib                   0x0000000197e91730 _dispatch_lane_invoke + 380
[machine1:42301] [16] 0   libdispatch.dylib                   0x0000000197e9c9a0 _dispatch_root_queue_drain_deferred_wlh + 288
[machine1:42301] [17] 0   libdispatch.dylib                   0x0000000197e9c1ec _dispatch_workloop_worker_thread + 540
[machine1:42301] [18] 0   libsystem_pthread.dylib             0x00000001980383d8 _pthread_wqthread + 288
[machine1:42301] [19] 0   libsystem_pthread.dylib             0x00000001980370f0 start_wqthread + 8
[machine1:42301] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
/opt/homebrew/anaconda3/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node machine1 exited on signal 6 (Abort trap: 6).
--------------------------------------------------------------------------

Jan 28 '25 02:01 silibattlebot

So the problem here is the pipeline parallel is pretty dumb and assumes each machine has an equal amount of RAM. It divides the model evenly in three sections and the third section is way too big for your 64GB M2 Ultra.

We could do something a bit more dynamic based on the machine size to support heterogenous machines.

Jan 28 '25 14:01 awni

is mlx doing its own sharding ? i thought you needed exo for that

Jan 28 '25 18:01 ProjectAtlantis-dev

Yes MLX can do distributed inference directly using mx.distributed. RIght now, it's a lower level API than what you can do with Exo. So depends on what you want to do.

Jan 28 '25 18:01 awni

"mpirun -np 3 --hostfile hosts.txt /opt/homebrew/anaconda3/bin/python3 /Users/user/deepseek/pipeline_generate.py --prompt "Hello world"" I have modified the corresponding path, but there is no response when running the above command. Is there a problem with my related dependency download?

Mar 04 '25 10:03 seconduncleniu

Check out the getting started guide for mx.distributed. Make sure you can run that simple example. If it doesn't work there are some tips on setting up MPI there that can help. If that works then the above should also work.. if it doesn't let us know.

Mar 04 '25 14:03 awni

I have implemented the Getting Started Guide for mx.distributed. When running 'mpirun-np 2-- host host1, host2 python3 pipine_generation. py', there is an issue with the 'group=mx. distributed. int (backend=' mpi ')' in the pipine_generation. py file. Has this script been updated now? Can you guide me on the next steps, thank you.

Mar 05 '25 03:03 seconduncleniu

Could you share more details about the issue you are seeing?

Mar 05 '25 03:03 awni

mpirun -np 2 python3 pipeline_generate.py
--prompt "What number is larger 6.9 or 6.11?"
--max-tokens 128
--model mlx-community/DeepSeek-R1-Distill-Llama-70B-3bit

Traceback (most recent call last): File "/Users/zhangchi/pipeline_generate.py", line 100, in Traceback (most recent call last): File "/Users/zhangchi/pipeline_generate.py", line 100, in group = mx.distributed.init(backend="mpi") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: init(): incompatible function arguments. The following argument types are supported:

init(strict: bool = False) -> Group Invoked with types: kwargs = { backend: str } group = mx.distributed.init(backend="mpi") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: init(): incompatible function arguments. The following argument types are supported:
init(strict: bool = False) -> Group Invoked with types: kwargs = { backend: str } The above is the command I ran and the error message. The pipine_generatione.py file uses the file from "https://raw.githubusercontent.com/ml-explore/mlx-examples/refs/heads/main/llms/mlx_lm/examples/pipeline_generate.py"

Mar 05 '25 05:03 seconduncleniu

Maybe try updating MLX? That line should work in a recent version.

Mar 05 '25 06:03 awni

I am using mlx-0.23.1. May I ask what the reason is? Running pipelin_generate.exe always gives an error. Is this the problem with file?

Mar 05 '25 09:03 seconduncleniu

Why does the following command run consistently without any response, neither reporting an error nor continuing to run？ Best wishes mpirun --oversubscribe -np 2 --hostfile hosts.txt python3 pipeline_generate.py --prompt "What number is larger 6.9 or 6.11?" --model mlx-community/DeepSeek-R1-3bit Fetching 5 files: 100%|██████████| 5/5 [00:00<00:00, 52560.20it/s] Fetching 5 files: 100%|██████████| 5/5 [00:00<00:00, 56833.39it/s]

Mar 12 '25 07:03 fengyy0111