macos cluster only load memory in first machine
When running exo on eight Mac minis (IP addresses 10.25.0.1–10.25.0.8), it was observed that only the first machine (10.25.0.1) loaded memory, while the other machines showed no change. During the configuration process, the environment was set up using the install.sh script, and a symbolic link was created in the shared storage for the downloaded weights path with the command: ln -s /Volumes/long990max/exo_data ~/.cache/exo Additionally, the path /Volumes/long990max/exo_data is shared across the entire Mac mini cluster via the Thunderbolt network bridge using the Samba protocol.
hi, do please correct me if I got it wrong. Have you already tested the existing SMB connection with your mac cluster? or select one small model to test dual-mac cluster rather than all 8 macs as you mentioned?
I use 8 macs
it show like this
mac mini(10.25.0.2-7) can not use internet, so show:
"~/project
/exo/.venv/lib/python3.12/
site-packages/aiohttp/conn
ector.py", line 1341, in
_create_direct_connection
raise
ClientConnectorDNSError(re
q.connection_key, exc)
from exc
aiohttp.client_exceptions.
ClientConnectorDNSError:
Cannot connect to host
huggingface.co:443
ssl:default [nodename nor
servname provided, or not
known]
Download error on attempt
12/30 for
repo_id='mlx-community/Dee
pSeek-R1-4bit'
revision='main'
path='model.safetensors.in
dex.json'
target_dir=PosixPath('/var
/folders/r_/tyjhn3z554dbdz
sllqj69kyh0000gn/T/exo/mlx
-community--DeepSeek-R1-4b
it')
Indeed, I do believe your mac cluster has enough capability to run the q4 mlx version (BTW, I just watched the real instance that one mac with M4 Max (128GB RAM) can run DeepSeek R1 Dynamic 1.58-bit with Llama.cpp.). Just connection issue between your first mac with other 7, and I am not sure whether your configuration of thunderbolt 5 using the Samba protocol is built correctly. Maybe you just test two of 8 your macs to make thunderbolt 5 bridge to run 4 bit deepseek 32B or directly test them via wireless network (which you have to install exo and deepseek respectively for each mac) to verify the connection.
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
zsh: abort exo --node-id=$NODE_ID --node-host=$CURRENT_HOST --discovery-module=udp
After enabling the networking function (Wi-Fi) of the remaining Mac Mini and using shared storage, I found that some machines had successfully loaded the weights into memory. Unfortunately, nodes 4 and 6 encountered the above error.
Indeed, I do believe your mac cluster has enough capability to run the q4 mlx version (BTW, I just watched the real instance that one mac with M4 Max (128GB RAM) can run DeepSeek R1 Dynamic 1.58-bit with Llama.cpp.). Just connection issue between your first mac with other 7, and I am not sure whether your configuration of thunderbolt 5 using the Samba protocol is built correctly. Maybe you just test two of 8 your macs to make thunderbolt 5 bridge to run 4 bit deepseek 32B or directly test them via wireless network (which you have to install exo and deepseek respectively for each mac) to verify the connection.
I have successfully completed the task of running a distributed inference model on exo using the Thunderbolt 5 bridge, although it wasn't the 4-bit DeepSeek 32B model. I can't quite remember which model it was. However, I do have some doubts about my Thunderbolt cables. The cables I purchased are from three different brands, and I'm not sure if that might cause any issues, even though they are all Thunderbolt 5 cables.
Sounds good! 老哥你现在速度多少啊?这样8台跑4位mlx版的r1是不是有点浪费,我感觉你跑原生671B都可以。
Sounds good! 老哥你现在速度多少啊?这样8台跑4位mlx版的r1是不是有点浪费,我感觉你跑原生671B都可以。
fp8 不支持的mlx没有加速
原生内存不够的,考虑到对话过程中有kvcache
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout) zsh: abort exo --node-id=$NODE_ID --node-host=$CURRENT_HOST --discovery-module=udp —————— I do not kown why exit with this error
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout) zsh: abort exo --node-id=$NODE_ID --node-host=$CURRENT_HOST --discovery-module=udp —————— I do not kown why exit with this error
mlx库的版本不一致?
都是最新的代码,mlx版本都应该是一致的。反正每次运行都有两台机器出现这个报错,然后退出。似乎在运行时候会哈希校验,必须从huggingface下载,从其他镜像下载就运行不了,报md5校验不一样
用mlx.launch调用倒是没有这个报错,但是又其他报错。mpirun也是正常的。都能正常加载到内存中去,共享存储也没有任何问题。就是推理有其他报错
mlx框架bug有很多
mlx.launch
--hostfile /Volumes/long990max/hosts.json
--backend mpi
--mpi-arg "--mca btl tcp,self
--mca btl_tcp_if_include 10.25.0.0/24
--mca oob_tcp_if_include 10.25.0.0/24
--mca oob_tcp_disable_family ipv6
--mca btl_tcp_links 2
--mca plm_base_verbose 100
--mca btl_base_verbose 100"
/Volumes/long990max/pipeline_generate.py
--prompt "What number is larger 6.9 or 6.11?"
--max-tokens 64
--model /Volumes/long990max/exo_data/downloads/mlx-community--DeepSeek-R1-3bit
--verbose
——————
这样启动没啥问题,但是巨慢,参考mlx框架运行的,mpirun有很多奇怪的解析问题。
都是最新的代码,mlx版本都应该是一致的。反正每次运行都有两台机器出现这个报错,然后退出。似乎在运行时候会哈希校验,必须从huggingface下载,从其他镜像下载就运行不了,报md5校验不一样
话说你模型是从hugging face下的然后放在exo的目录下面,对吗?
最好让脚本自己下,别用镜像源
mlx.launch --hostfile /Volumes/long990max/hosts.json --backend mpi --mpi-arg "--mca btl tcp,self --mca btl_tcp_if_include 10.25.0.0/24 --mca oob_tcp_if_include 10.25.0.0/24 --mca oob_tcp_disable_family ipv6 --mca btl_tcp_links 2 --mca plm_base_verbose 100 --mca btl_base_verbose 100" /Volumes/long990max/pipeline_generate.py --prompt "What number is larger 6.9 or 6.11?" --max-tokens 64 --model /Volumes/long990max/exo_data/downloads/mlx-community--DeepSeek-R1-3bit --verbose —————— 这样启动没啥问题,但是巨慢,参考mlx框架运行的,mpirun有很多奇怪的解析问题。
这样可以顺利将权重加载到内存中去
要注意,关闭ipv6,只用雷电网桥
就是exo好像不定期会加点cache文件到模型下载目录里,所以模型大小总跟镜像里不一样,大小不一样就开始间接性error file removing。
就是exo好像不定期会加点cache文件到模型下载目录里,所以模型大小总跟镜像里不一样,大小不一样就开始间接性error file removing。
exo是分块下载到每台机器的