kvcache cluster cannot start
🐛 Describe the bug
It still hasn't started successfully after 10 minutes. The log is as follows.
[2025-05-28 09:41:06.824] [infini] [info] ServerConfig: service_port=12345, manage_port=8088, log_level='info', dev_name='mlx5_1', ib_port=1, link_type='Ethernet', prealloc_size=45, minimal_allocate_size=64, auto_increase=False, evict_min_threshold=0.6, evict_max_threshold=0.8, evict_interval=5, hint_gid_index=7
[2025-05-28 09:41:06.824] [infini] [info] open rdma device mlx5_1, link_type Ethernet, hint_gid_index 7,
[2025-05-28 09:41:06.825] [infini] [info] Interrupt signal (11) received.
[2025-05-28 09:41:06.825] [infini] [error] [utils.cpp:99] Stacktrace:
0# signal_handler(int) in /usr/local/lib/python3.10/dist-packages/infinistore/_infinistore.cpython-310-x86_64-linux-gnu.so
1# 0x00007F3480D41520 in /lib/x86_64-linux-gnu/libc.so.6
2# 0x00007F3480EB10BC in /lib/x86_64-linux-gnu/libc.so.6
3# decltype ({parm#1}(0)) fmt::v11::basic_format_argfmt::v11::context::visit<fmt::v11::detail::default_arg_formatter
Steps to Reproduce
deploy the distributed KV cache cluster with the following yaml configuration:
apiVersion: orchestration.aibrix.ai/v1alpha1 kind: KVCache metadata: name: kvcache-cluster namespace: default annotations: kvcache.orchestration.aibrix.ai/backend: infinistore infinistore.kvcache.orchestration.aibrix.ai/link-type: "Ethernet" infinistore.kvcache.orchestration.aibrix.ai/hint-gid-index: "7" spec: metadata: redis: runtime: image: aibrix-cn-beijing.cr.volces.com/aibrix/redis:7.4.2 replicas: 1 resources: requests: cpu: 1000m memory: 1Gi limits: cpu: 1000m memory: 1Gi service: type: ClusterIP ports: - name: service port: 12345 targetPort: 12345 protocol: TCP - name: admin port: 8088 targetPort: 8088 protocol: TCP watcher: image: aibrix-cn-beijing.cr.volces.com/aibrix/kvcache-watcher:v0.3.0 imagePullPolicy: Always resources: requests: cpu: "500m" memory: "256Mi" limits: cpu: "500m" memory: "256Mi" cache: replicas: 1 image: aibrix-cn-beijing.cr.volces.com/aibrix/infinistore:v0.2.42-20250506 imagePullPolicy: IfNotPresent resources: requests: cpu: "10000m" memory: "120Gi" vke.volcengine.com/rdma: "1" limits: cpu: "10000m" memory: "120Gi" vke.volcengine.com/rdma: "1"
Expected behavior
kv cache cluster start success
Environment
aibrix version 0.3.0
@libin817927 what's your environments? Are you running on volcano engine? if so, please share the node image details. If not, please help me know how to allocate RDMA resources in your cluster?
@libin817927 what's your environments? Are you running on volcano engine? if so, please share the node image details. If not, please help me know how to allocate RDMA resources in your cluster?
We have deployed our own Kubernetes cluster, which includes one node with 8 L4 GPUs (where the GPUs communicate via PCIe) and five CPU nodes. The GPU network topology is as follows: Please inform us about the hardware requirements for deploying a KV cache cluster, particularly regarding network interface cards (NICs).
@libin817927 nic allocation is different in different environments. could you share the how to allocate mlx5 nic to the pod in your environment, then I can help write a sample for your case
@libin817927 what's your environments? Are you running on volcano engine? if so, please share the node image details. If not, please help me know how to allocate RDMA resources in your cluster?
We have deployed our own Kubernetes cluster, which includes one node with 8 L4 GPUs (where the GPUs communicate via PCIe) and five CPU nodes. The GPU network topology is as follows: Please inform us about the hardware requirements for deploying a KV cache cluster, particularly regarding network interface cards (NICs).
Could you please use ibv_devinfo -e and ibv_devinfo -d mlx5_bond_0 to show more details? BTW, We don't use rdma bonding in our environment, so we haven't tested whether InfiniStore support bonding yet.
@libin817927 what's your environments? Are you running on volcano engine? if so, please share the node image details. If not, please help me know how to allocate RDMA resources in your cluster?
We have deployed our own Kubernetes cluster, which includes one node with 8 L4 GPUs (where the GPUs communicate via PCIe) and five CPU nodes. The GPU network topology is as follows: Please inform us about the hardware requirements for deploying a KV cache cluster, particularly regarding network interface cards (NICs).
Could you please use
ibv_devinfo -eandibv_devinfo -d mlx5_bond_0to show more details? BTW, We don't use rdma bonding in our environment, so we haven't tested whether InfiniStore support bonding yet.
hca_id: mlx5_bond_0 transport: InfiniBand (0) fw_ver: 26.40.1000 node_guid: b83f:d203:00db:a8dc sys_image_guid: b83f:d203:00db:a8dc vendor_id: 0x02c9 vendor_part_id: 4127 hw_ver: 0x0 board_id: MT_0000000531 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet
hint_gid_index is for specific environments, could you run ibv_devinfo -v let's see which gid should be used in your env
@thesues have you tested mlx5_bond_0 setup?