Pod Init RDMA failed!Invalid RDMA endpoint: Fall back to TCP
🚀 Feature Description and Motivation
I use the kvcache test demo, and the RDMA init failed. Env: A100 Demo: https://github.com/vllm-project/aibrix/tree/main/samples/kvcache Image:aibrix-vineyardd:20241120
Use Case
command:
I20250324 02:24:00.960508 7 etcd_launcher.cc:166] The etcd endpoint http://deepseek-coder-7b-kvcache-etcd-service:2379 is connected
I20250324 02:24:00.981451 7 malloc.cc:88] Use memfd_create(2) for shared memory
I20250324 02:24:00.981554 7 usage.h:663] No spill path set, spill has been disabled ...
I20250324 02:24:00.995638 7 etcd_meta_service.cc:360] start background etcd watch, since 1
I20250324 02:24:01.000780 7 meta_service.cc:577] Decide to set rank as 0
I20250324 02:24:01.025059 7 meta_service.cc:1195] Instance join: 0
I20250324 02:24:01.030133 7 ipc_server.cc:137] Vineyard will listen on "/var/run/vineyard.sock" for IPC
I20250324 02:24:01.030174 7 rpc_server.cc:105] Vineyard will listen on 0.0.0.0:9600 for RPC
I20250324 02:24:01.030190 7 rpc_server.cc:111] Init RDMA failed!Invalid RDMA endpoint: Fall back to TCP.
- /bin/bash
- -c
- /usr/local/bin/vineyardd --sync_crds true --socket /var/run/vineyard.sock --size
--stream_threshold 80 --etcd_cmd etcd --etcd_prefix /vineyard --etcd_endpoint
http://deepseek-coder-7b-kvcache-etcd-service:2379
Proposed Solution
No response
- The memory should not be less than 32G.
- Wait for a few minutes after the server is started, and then customize the model.
- 内存不低于32G
- 服务端启动后等几分钟再定制模特
你参考下这个看看:https://github.com/GuijiAI/HeyGem.ai/issues/66#issuecomment-2729250177
我也遇到了,我这边的原因是容器的存储卷没有挂载成功 wsl2查看文件系统
Filesystem Size Used Avail Use% Mounted on
none 7.9G 0 7.9G 0% /usr/lib/modules/5.15.167.4-microsoft-standard-WSL2
none 7.9G 4.0K 7.9G 1% /mnt/wsl
drivers 200G 154G 47G 77% /usr/lib/wsl/drivers
/dev/sdc 251G 1.7G 237G 1% /
none 7.9G 36K 7.9G 1% /mnt/wslg
none 7.9G 0 7.9G 0% /usr/lib/wsl/lib
rootfs 7.9G 2.4M 7.8G 1% /init
none 7.9G 580K 7.9G 1% /run
none 7.9G 0 7.9G 0% /run/lock
none 7.9G 0 7.9G 0% /run/shm
tmpfs 4.0M 0 4.0M 0% /sys/fs/cgroup
none 7.9G 64K 7.9G 1% /mnt/wslg/versions.txt
none 7.9G 64K 7.9G 1% /mnt/wslg/doc
C:\ 200G 154G 47G 77% /mnt/c
tmpfs 1.6G 16K 1.6G 1% /run/user/1000
tmpfs 1.6G 16K 1.6G 1% /run/user/0
none 7.9G 580K 7.9G 1% /mnt/wsl/docker-desktop/shared-sockets/host-services
/dev/sdd 1007G 57M 956G 1% /mnt/wsl/docker-desktop/docker-desktop-user-distro
/dev/loop0 482M 482M 0 100% /mnt/wsl/docker-desktop/cli-tools
C:\Program Files\Docker\Docker\resources 200G 154G 47G 77% /Docker/host
d: 200G 145M 200G 1% /mnt/d
win的d盘挂载在/mnt/d目录下
docker-compose.yml改一下volumes字段:
services:
heygem-tts:
...
volumes:
- /mnt/d/heygem_data/voice/data:/code/data
...
heygem-f2f:
...
volumes:
- /mnt/d/heygem_data/face2face:/code/data
我也遇到了,我这边的原因是容器的存储卷没有挂载成功 wsl2查看文件系统
Filesystem Size Used Avail Use% Mounted on none 7.9G 0 7.9G 0% /usr/lib/modules/5.15.167.4-microsoft-standard-WSL2 none 7.9G 4.0K 7.9G 1% /mnt/wsl drivers 200G 154G 47G 77% /usr/lib/wsl/drivers /dev/sdc 251G 1.7G 237G 1% / none 7.9G 36K 7.9G 1% /mnt/wslg none 7.9G 0 7.9G 0% /usr/lib/wsl/lib rootfs 7.9G 2.4M 7.8G 1% /init none 7.9G 580K 7.9G 1% /run none 7.9G 0 7.9G 0% /run/lock none 7.9G 0 7.9G 0% /run/shm tmpfs 4.0M 0 4.0M 0% /sys/fs/cgroup none 7.9G 64K 7.9G 1% /mnt/wslg/versions.txt none 7.9G 64K 7.9G 1% /mnt/wslg/doc C:\ 200G 154G 47G 77% /mnt/c tmpfs 1.6G 16K 1.6G 1% /run/user/1000 tmpfs 1.6G 16K 1.6G 1% /run/user/0 none 7.9G 580K 7.9G 1% /mnt/wsl/docker-desktop/shared-sockets/host-services /dev/sdd 1007G 57M 956G 1% /mnt/wsl/docker-desktop/docker-desktop-user-distro /dev/loop0 482M 482M 0 100% /mnt/wsl/docker-desktop/cli-tools C:\Program Files\Docker\Docker\resources 200G 154G 47G 77% /Docker/host d: 200G 145M 200G 1% /mnt/d win的d盘挂载在/mnt/d目录下
docker-compose.yml改一下volumes字段:
services: heygem-tts: ... volumes: - /mnt/d/heygem_data/voice/data:/code/data ... heygem-f2f: ... volumes: - /mnt/d/heygem_data/face2face:/code/data
试试看在win系统里把D盘heygem_data目录给个everyon
给你一个方法:docker desktop的Setting-》Resources-》Network 把Enable host networking前面的框勾选上。不然虚拟器里的服务不能被外部访问。 改完后应用,最好重启下。希望能解决了,我的就是如此就好了
给你一个方法:docker desktop的Setting-》Resources-》Network 把Enable host networking前面的框勾选上。不然虚拟器里的服务不能被外部访问。 改完后应用,最好重启下。希望能解决了,我的就是如此就好了
的確可以解決問題 謝謝大大!
我把你们说的都试了,还是报错
你按下面的步骤做了没?要把host networking 打开。 如果实在不行可以试试刘悦大佬的整合包。但是这里也要按这样设置。不然docker的应用端口不能被物理机的电脑访问。 HeyGem数字人一键启动镜像整合包:https://pan.quark.cn/s/b5e86c41935f
naijin @.***
------------------ 原始邮件 ------------------ 发件人: "GuijiAI/HeyGem.ai" @.>; 发送时间: 2025年3月20日(星期四) 下午2:41 @.>; @.@.>; 主题: Re: [GuijiAI/HeyGem.ai] 这个有人解决了吗Error: Error invoking remote method 'model/addModel': TypeError: SQLite3 can only bind numbers, strings, bigints, buffers, and null (Issue #236)
我把你们说的都试了,还是报错
1.png (view on web)
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***> youzipp2025 left a comment (GuijiAI/HeyGem.ai#236)
我把你们说的都试了,还是报错
1.png (view on web)
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>