exo icon indicating copy to clipboard operation
exo copied to clipboard

macos cluster only load memory in first machine

Open hotwa opened this issue 10 months ago • 22 comments

When running exo on eight Mac minis (IP addresses 10.25.0.1–10.25.0.8), it was observed that only the first machine (10.25.0.1) loaded memory, while the other machines showed no change. During the configuration process, the environment was set up using the install.sh script, and a symbolic link was created in the shared storage for the downloaded weights path with the command: ln -s /Volumes/long990max/exo_data ~/.cache/exo Additionally, the path /Volumes/long990max/exo_data is shared across the entire Mac mini cluster via the Thunderbolt network bridge using the Samba protocol.

hotwa avatar Feb 25 '25 06:02 hotwa

hi, do please correct me if I got it wrong. Have you already tested the existing SMB connection with your mac cluster? or select one small model to test dual-mac cluster rather than all 8 macs as you mentioned?

xuanzhec avatar Feb 25 '25 07:02 xuanzhec

I use 8 macs

hotwa avatar Feb 25 '25 07:02 hotwa

Image

it show like this

hotwa avatar Feb 25 '25 07:02 hotwa

mac mini(10.25.0.2-7) can not use internet, so show:

"~/project
/exo/.venv/lib/python3.12/
site-packages/aiohttp/conn
ector.py", line 1341, in 
_create_direct_connection
    raise 
ClientConnectorDNSError(re
q.connection_key, exc) 
from exc
aiohttp.client_exceptions.
ClientConnectorDNSError: 
Cannot connect to host 
huggingface.co:443 
ssl:default [nodename nor 
servname provided, or not 
known]
Download error on attempt 
12/30 for 
repo_id='mlx-community/Dee
pSeek-R1-4bit' 
revision='main' 
path='model.safetensors.in
dex.json' 
target_dir=PosixPath('/var
/folders/r_/tyjhn3z554dbdz
sllqj69kyh0000gn/T/exo/mlx
-community--DeepSeek-R1-4b
it')

hotwa avatar Feb 25 '25 07:02 hotwa

Indeed, I do believe your mac cluster has enough capability to run the q4 mlx version (BTW, I just watched the real instance that one mac with M4 Max (128GB RAM) can run DeepSeek R1 Dynamic 1.58-bit with Llama.cpp.). Just connection issue between your first mac with other 7, and I am not sure whether your configuration of thunderbolt 5 using the Samba protocol is built correctly. Maybe you just test two of 8 your macs to make thunderbolt 5 bridge to run 4 bit deepseek 32B or directly test them via wireless network (which you have to install exo and deepseek respectively for each mac) to verify the connection.

xuanzhec avatar Feb 25 '25 07:02 xuanzhec

libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
zsh: abort exo --node-id=$NODE_ID --node-host=$CURRENT_HOST --discovery-module=udp

After enabling the networking function (Wi-Fi) of the remaining Mac Mini and using shared storage, I found that some machines had successfully loaded the weights into memory. Unfortunately, nodes 4 and 6 encountered the above error.

hotwa avatar Feb 25 '25 08:02 hotwa

Indeed, I do believe your mac cluster has enough capability to run the q4 mlx version (BTW, I just watched the real instance that one mac with M4 Max (128GB RAM) can run DeepSeek R1 Dynamic 1.58-bit with Llama.cpp.). Just connection issue between your first mac with other 7, and I am not sure whether your configuration of thunderbolt 5 using the Samba protocol is built correctly. Maybe you just test two of 8 your macs to make thunderbolt 5 bridge to run 4 bit deepseek 32B or directly test them via wireless network (which you have to install exo and deepseek respectively for each mac) to verify the connection.

I have successfully completed the task of running a distributed inference model on exo using the Thunderbolt 5 bridge, although it wasn't the 4-bit DeepSeek 32B model. I can't quite remember which model it was. However, I do have some doubts about my Thunderbolt cables. The cables I purchased are from three different brands, and I'm not sure if that might cause any issues, even though they are all Thunderbolt 5 cables.

hotwa avatar Feb 25 '25 08:02 hotwa

Sounds good! 老哥你现在速度多少啊?这样8台跑4位mlx版的r1是不是有点浪费,我感觉你跑原生671B都可以。

xuanzhec avatar Feb 25 '25 08:02 xuanzhec

Sounds good! 老哥你现在速度多少啊?这样8台跑4位mlx版的r1是不是有点浪费,我感觉你跑原生671B都可以。

fp8 不支持的mlx没有加速

hotwa avatar Feb 25 '25 08:02 hotwa

原生内存不够的,考虑到对话过程中有kvcache

hotwa avatar Feb 25 '25 08:02 hotwa

libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout) zsh: abort exo --node-id=$NODE_ID --node-host=$CURRENT_HOST --discovery-module=udp —————— I do not kown why exit with this error

hotwa avatar Feb 25 '25 08:02 hotwa

libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout) zsh: abort exo --node-id=$NODE_ID --node-host=$CURRENT_HOST --discovery-module=udp —————— I do not kown why exit with this error

mlx库的版本不一致?

xuanzhec avatar Feb 25 '25 08:02 xuanzhec

都是最新的代码,mlx版本都应该是一致的。反正每次运行都有两台机器出现这个报错,然后退出。似乎在运行时候会哈希校验,必须从huggingface下载,从其他镜像下载就运行不了,报md5校验不一样

hotwa avatar Feb 25 '25 08:02 hotwa

用mlx.launch调用倒是没有这个报错,但是又其他报错。mpirun也是正常的。都能正常加载到内存中去,共享存储也没有任何问题。就是推理有其他报错

hotwa avatar Feb 25 '25 08:02 hotwa

mlx框架bug有很多

hotwa avatar Feb 25 '25 08:02 hotwa

mlx.launch
--hostfile /Volumes/long990max/hosts.json
--backend mpi
--mpi-arg "--mca btl tcp,self
--mca btl_tcp_if_include 10.25.0.0/24
--mca oob_tcp_if_include 10.25.0.0/24
--mca oob_tcp_disable_family ipv6
--mca btl_tcp_links 2
--mca plm_base_verbose 100
--mca btl_base_verbose 100"
/Volumes/long990max/pipeline_generate.py
--prompt "What number is larger 6.9 or 6.11?"
--max-tokens 64
--model /Volumes/long990max/exo_data/downloads/mlx-community--DeepSeek-R1-3bit
--verbose —————— 这样启动没啥问题,但是巨慢,参考mlx框架运行的,mpirun有很多奇怪的解析问题。

hotwa avatar Feb 25 '25 08:02 hotwa

都是最新的代码,mlx版本都应该是一致的。反正每次运行都有两台机器出现这个报错,然后退出。似乎在运行时候会哈希校验,必须从huggingface下载,从其他镜像下载就运行不了,报md5校验不一样

话说你模型是从hugging face下的然后放在exo的目录下面,对吗?

xuanzhec avatar Feb 25 '25 08:02 xuanzhec

最好让脚本自己下,别用镜像源

hotwa avatar Feb 25 '25 08:02 hotwa

mlx.launch --hostfile /Volumes/long990max/hosts.json --backend mpi --mpi-arg "--mca btl tcp,self --mca btl_tcp_if_include 10.25.0.0/24 --mca oob_tcp_if_include 10.25.0.0/24 --mca oob_tcp_disable_family ipv6 --mca btl_tcp_links 2 --mca plm_base_verbose 100 --mca btl_base_verbose 100" /Volumes/long990max/pipeline_generate.py --prompt "What number is larger 6.9 or 6.11?" --max-tokens 64 --model /Volumes/long990max/exo_data/downloads/mlx-community--DeepSeek-R1-3bit --verbose —————— 这样启动没啥问题,但是巨慢,参考mlx框架运行的,mpirun有很多奇怪的解析问题。

这样可以顺利将权重加载到内存中去

hotwa avatar Feb 25 '25 08:02 hotwa

要注意,关闭ipv6,只用雷电网桥

hotwa avatar Feb 25 '25 08:02 hotwa

就是exo好像不定期会加点cache文件到模型下载目录里,所以模型大小总跟镜像里不一样,大小不一样就开始间接性error file removing。

xuanzhec avatar Feb 25 '25 08:02 xuanzhec

就是exo好像不定期会加点cache文件到模型下载目录里,所以模型大小总跟镜像里不一样,大小不一样就开始间接性error file removing。

exo是分块下载到每台机器的

hotwa avatar Feb 25 '25 11:02 hotwa