aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

Pod Init RDMA failed!Invalid RDMA endpoint: Fall back to TCP

Open ying2025 opened this issue 9 months ago • 8 comments

🚀 Feature Description and Motivation

I use the kvcache test demo, and the RDMA init failed. Env: A100 Demo: https://github.com/vllm-project/aibrix/tree/main/samples/kvcache Image:aibrix-vineyardd:20241120

Use Case

command:

I20250324 02:24:00.960508     7 etcd_launcher.cc:166] The etcd endpoint http://deepseek-coder-7b-kvcache-etcd-service:2379 is connected
I20250324 02:24:00.981451     7 malloc.cc:88] Use memfd_create(2) for shared memory
I20250324 02:24:00.981554     7 usage.h:663] No spill path set, spill has been disabled ...
I20250324 02:24:00.995638     7 etcd_meta_service.cc:360] start background etcd watch, since 1
I20250324 02:24:01.000780     7 meta_service.cc:577] Decide to set rank as 0
I20250324 02:24:01.025059     7 meta_service.cc:1195] Instance join: 0
I20250324 02:24:01.030133     7 ipc_server.cc:137] Vineyard will listen on "/var/run/vineyard.sock" for IPC
I20250324 02:24:01.030174     7 rpc_server.cc:105] Vineyard will listen on 0.0.0.0:9600 for RPC
I20250324 02:24:01.030190     7 rpc_server.cc:111] Init RDMA failed!Invalid RDMA endpoint:  Fall back to TCP.
    - /bin/bash
    - -c
    - /usr/local/bin/vineyardd --sync_crds true --socket /var/run/vineyard.sock --size
      --stream_threshold 80 --etcd_cmd etcd --etcd_prefix /vineyard --etcd_endpoint
      http://deepseek-coder-7b-kvcache-etcd-service:2379

Proposed Solution

No response

ying2025 avatar Mar 24 '25 03:03 ying2025

  1. The memory should not be less than 32G.
  2. Wait for a few minutes after the server is started, and then customize the model.

  1. 内存不低于32G
  2. 服务端启动后等几分钟再定制模特

whl88 avatar Mar 17 '25 09:03 whl88

你参考下这个看看:https://github.com/GuijiAI/HeyGem.ai/issues/66#issuecomment-2729250177

doowu avatar Mar 17 '25 12:03 doowu

我也遇到了,我这边的原因是容器的存储卷没有挂载成功 wsl2查看文件系统

Filesystem                                Size  Used Avail Use% Mounted on
none                                      7.9G     0  7.9G   0% /usr/lib/modules/5.15.167.4-microsoft-standard-WSL2
none                                      7.9G  4.0K  7.9G   1% /mnt/wsl
drivers                                   200G  154G   47G  77% /usr/lib/wsl/drivers
/dev/sdc                                  251G  1.7G  237G   1% /
none                                      7.9G   36K  7.9G   1% /mnt/wslg
none                                      7.9G     0  7.9G   0% /usr/lib/wsl/lib
rootfs                                    7.9G  2.4M  7.8G   1% /init
none                                      7.9G  580K  7.9G   1% /run
none                                      7.9G     0  7.9G   0% /run/lock
none                                      7.9G     0  7.9G   0% /run/shm
tmpfs                                     4.0M     0  4.0M   0% /sys/fs/cgroup
none                                      7.9G   64K  7.9G   1% /mnt/wslg/versions.txt
none                                      7.9G   64K  7.9G   1% /mnt/wslg/doc
C:\                                       200G  154G   47G  77% /mnt/c
tmpfs                                     1.6G   16K  1.6G   1% /run/user/1000
tmpfs                                     1.6G   16K  1.6G   1% /run/user/0
none                                      7.9G  580K  7.9G   1% /mnt/wsl/docker-desktop/shared-sockets/host-services
/dev/sdd                                 1007G   57M  956G   1% /mnt/wsl/docker-desktop/docker-desktop-user-distro
/dev/loop0                                482M  482M     0 100% /mnt/wsl/docker-desktop/cli-tools
C:\Program Files\Docker\Docker\resources  200G  154G   47G  77% /Docker/host
d:                                        200G  145M  200G   1% /mnt/d

win的d盘挂载在/mnt/d目录下

docker-compose.yml改一下volumes字段:

services:
  heygem-tts:
    ...
    volumes:
      - /mnt/d/heygem_data/voice/data:/code/data
    ...
  heygem-f2f:
    ...
    volumes:
      - /mnt/d/heygem_data/face2face:/code/data

SingleLC avatar Mar 18 '25 02:03 SingleLC

我也遇到了,我这边的原因是容器的存储卷没有挂载成功 wsl2查看文件系统

Filesystem Size Used Avail Use% Mounted on none 7.9G 0 7.9G 0% /usr/lib/modules/5.15.167.4-microsoft-standard-WSL2 none 7.9G 4.0K 7.9G 1% /mnt/wsl drivers 200G 154G 47G 77% /usr/lib/wsl/drivers /dev/sdc 251G 1.7G 237G 1% / none 7.9G 36K 7.9G 1% /mnt/wslg none 7.9G 0 7.9G 0% /usr/lib/wsl/lib rootfs 7.9G 2.4M 7.8G 1% /init none 7.9G 580K 7.9G 1% /run none 7.9G 0 7.9G 0% /run/lock none 7.9G 0 7.9G 0% /run/shm tmpfs 4.0M 0 4.0M 0% /sys/fs/cgroup none 7.9G 64K 7.9G 1% /mnt/wslg/versions.txt none 7.9G 64K 7.9G 1% /mnt/wslg/doc C:\ 200G 154G 47G 77% /mnt/c tmpfs 1.6G 16K 1.6G 1% /run/user/1000 tmpfs 1.6G 16K 1.6G 1% /run/user/0 none 7.9G 580K 7.9G 1% /mnt/wsl/docker-desktop/shared-sockets/host-services /dev/sdd 1007G 57M 956G 1% /mnt/wsl/docker-desktop/docker-desktop-user-distro /dev/loop0 482M 482M 0 100% /mnt/wsl/docker-desktop/cli-tools C:\Program Files\Docker\Docker\resources 200G 154G 47G 77% /Docker/host d: 200G 145M 200G 1% /mnt/d win的d盘挂载在/mnt/d目录下

docker-compose.yml改一下volumes字段:

services: heygem-tts: ... volumes: - /mnt/d/heygem_data/voice/data:/code/data ... heygem-f2f: ... volumes: - /mnt/d/heygem_data/face2face:/code/data

试试看在win系统里把D盘heygem_data目录给个everyon

Image

jason-ji-227 avatar Mar 18 '25 04:03 jason-ji-227

给你一个方法:docker desktop的Setting-》Resources-》Network 把Enable host networking前面的框勾选上。不然虚拟器里的服务不能被外部访问。 改完后应用,最好重启下。希望能解决了,我的就是如此就好了

gaonaijin avatar Mar 18 '25 09:03 gaonaijin

给你一个方法:docker desktop的Setting-》Resources-》Network 把Enable host networking前面的框勾选上。不然虚拟器里的服务不能被外部访问。 改完后应用,最好重启下。希望能解决了,我的就是如此就好了

的確可以解決問題 謝謝大大!

joytsay avatar Mar 19 '25 06:03 joytsay

我把你们说的都试了,还是报错

Image

youzipp2025 avatar Mar 20 '25 06:03 youzipp2025

你按下面的步骤做了没?要把host networking 打开。 如果实在不行可以试试刘悦大佬的整合包。但是这里也要按这样设置。不然docker的应用端口不能被物理机的电脑访问。 HeyGem数字人一键启动镜像整合包:https://pan.quark.cn/s/b5e86c41935f

naijin @.***

 

------------------ 原始邮件 ------------------ 发件人: "GuijiAI/HeyGem.ai" @.>; 发送时间: 2025年3月20日(星期四) 下午2:41 @.>; @.@.>; 主题: Re: [GuijiAI/HeyGem.ai] 这个有人解决了吗Error: Error invoking remote method 'model/addModel': TypeError: SQLite3 can only bind numbers, strings, bigints, buffers, and null (Issue #236)

我把你们说的都试了,还是报错

1.png (view on web)

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***> youzipp2025 left a comment (GuijiAI/HeyGem.ai#236)

我把你们说的都试了,还是报错

1.png (view on web)

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

gaonaijin avatar Mar 20 '25 06:03 gaonaijin