aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

kvcache cluster cannot start

Open libin817927 opened this issue 7 months ago • 9 comments

🐛 Describe the bug

Image

It still hasn't started successfully after 10 minutes. The log is as follows.

[2025-05-28 09:41:06.824] [infini] [info] ServerConfig: service_port=12345, manage_port=8088, log_level='info', dev_name='mlx5_1', ib_port=1, link_type='Ethernet', prealloc_size=45, minimal_allocate_size=64, auto_increase=False, evict_min_threshold=0.6, evict_max_threshold=0.8, evict_interval=5, hint_gid_index=7 [2025-05-28 09:41:06.824] [infini] [info] open rdma device mlx5_1, link_type Ethernet, hint_gid_index 7, [2025-05-28 09:41:06.825] [infini] [info] Interrupt signal (11) received. [2025-05-28 09:41:06.825] [infini] [error] [utils.cpp:99] Stacktrace: 0# signal_handler(int) in /usr/local/lib/python3.10/dist-packages/infinistore/_infinistore.cpython-310-x86_64-linux-gnu.so 1# 0x00007F3480D41520 in /lib/x86_64-linux-gnu/libc.so.6 2# 0x00007F3480EB10BC in /lib/x86_64-linux-gnu/libc.so.6 3# decltype ({parm#1}(0)) fmt::v11::basic_format_argfmt::v11::context::visit<fmt::v11::detail::default_arg_formatter >(fmt::v11::detail::default_arg_formatter&&) const in /usr/local/lib/python3.10/dist-packages/infinistore/_infinistore.cpython-310-x86_64-linux-gnu.so 4# char const* fmt::v11::detail::parse_replacement_field<char, fmt::v11::detail::format_handler&>(char const*, char const*, fmt::v11::detail::format_handler&) in /usr/local/lib/python3.10/dist-packages/infinistore/_infinistore.cpython-310-x86_64-linux-gnu.so 5# fmt::v11::detail::vformat_to(fmt::v11::detail::buffer&, fmt::v11::basic_string_view, fmt::v11::basic_format_argsfmt::v11::context, fmt::v11::detail::locale_ref) in /usr/local/lib/python3.10/dist-packages/infinistore/_infinistore.cpython-310-x86_64-linux-gnu.so 6# open_rdma_device(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, int, rdma_device*) in /usr/local/lib/python3.10/dist-packages/infinistore/_infinistore.cpython-310-x86_64-linux-gnu.so 7# register_server(unsigned long, ServerConfig) in /usr/local/lib/python3.10/dist-packages/infinistore/_infinistore.cpython-310-x86_64-linux-gnu.so 8# 0x00007F34808647C0 in /usr/local/lib/python3.10/dist-packages/infinistore/_infinistore.cpython-310-x86_64-linux-gnu.so 9# 0x00007F3480858584 in /usr/local/lib/python3.10/dist-packages/infinistore/_infinistore.cpython-310-x86_64-linux-gnu.so 10# 0x0000556F14DADE12 in /usr/bin/python3 11# _PyObject_MakeTpCall in /usr/bin/python3

Steps to Reproduce

deploy the distributed KV cache cluster with the following yaml configuration:

apiVersion: orchestration.aibrix.ai/v1alpha1 kind: KVCache metadata: name: kvcache-cluster namespace: default annotations: kvcache.orchestration.aibrix.ai/backend: infinistore infinistore.kvcache.orchestration.aibrix.ai/link-type: "Ethernet" infinistore.kvcache.orchestration.aibrix.ai/hint-gid-index: "7" spec: metadata: redis: runtime: image: aibrix-cn-beijing.cr.volces.com/aibrix/redis:7.4.2 replicas: 1 resources: requests: cpu: 1000m memory: 1Gi limits: cpu: 1000m memory: 1Gi service: type: ClusterIP ports: - name: service port: 12345 targetPort: 12345 protocol: TCP - name: admin port: 8088 targetPort: 8088 protocol: TCP watcher: image: aibrix-cn-beijing.cr.volces.com/aibrix/kvcache-watcher:v0.3.0 imagePullPolicy: Always resources: requests: cpu: "500m" memory: "256Mi" limits: cpu: "500m" memory: "256Mi" cache: replicas: 1 image: aibrix-cn-beijing.cr.volces.com/aibrix/infinistore:v0.2.42-20250506 imagePullPolicy: IfNotPresent resources: requests: cpu: "10000m" memory: "120Gi" vke.volcengine.com/rdma: "1" limits: cpu: "10000m" memory: "120Gi" vke.volcengine.com/rdma: "1"

Expected behavior

kv cache cluster start success

Environment

aibrix version 0.3.0

libin817927 avatar May 28 '25 09:05 libin817927

@libin817927 what's your environments? Are you running on volcano engine? if so, please share the node image details. If not, please help me know how to allocate RDMA resources in your cluster?

Jeffwan avatar May 28 '25 23:05 Jeffwan

@libin817927 what's your environments? Are you running on volcano engine? if so, please share the node image details. If not, please help me know how to allocate RDMA resources in your cluster?

We have deployed our own Kubernetes cluster, which includes one node with 8 L4 GPUs (where the GPUs communicate via PCIe) and five CPU nodes. The GPU network topology is as follows: Please inform us about the hardware requirements for deploying a KV cache cluster, particularly regarding network interface cards (NICs).

Image

libin817927 avatar May 29 '25 03:05 libin817927

@libin817927 nic allocation is different in different environments. could you share the how to allocate mlx5 nic to the pod in your environment, then I can help write a sample for your case

Jeffwan avatar May 29 '25 03:05 Jeffwan

@libin817927 what's your environments? Are you running on volcano engine? if so, please share the node image details. If not, please help me know how to allocate RDMA resources in your cluster?

We have deployed our own Kubernetes cluster, which includes one node with 8 L4 GPUs (where the GPUs communicate via PCIe) and five CPU nodes. The GPU network topology is as follows: Please inform us about the hardware requirements for deploying a KV cache cluster, particularly regarding network interface cards (NICs).

Image

Could you please use ibv_devinfo -e and ibv_devinfo -d mlx5_bond_0 to show more details? BTW, We don't use rdma bonding in our environment, so we haven't tested whether InfiniStore support bonding yet.

DwyaneShi avatar May 29 '25 03:05 DwyaneShi

@libin817927 what's your environments? Are you running on volcano engine? if so, please share the node image details. If not, please help me know how to allocate RDMA resources in your cluster?

We have deployed our own Kubernetes cluster, which includes one node with 8 L4 GPUs (where the GPUs communicate via PCIe) and five CPU nodes. The GPU network topology is as follows: Please inform us about the hardware requirements for deploying a KV cache cluster, particularly regarding network interface cards (NICs). Image

Could you please use ibv_devinfo -e and ibv_devinfo -d mlx5_bond_0 to show more details? BTW, We don't use rdma bonding in our environment, so we haven't tested whether InfiniStore support bonding yet.

hca_id: mlx5_bond_0 transport: InfiniBand (0) fw_ver: 26.40.1000 node_guid: b83f:d203:00db:a8dc sys_image_guid: b83f:d203:00db:a8dc vendor_id: 0x02c9 vendor_part_id: 4127 hw_ver: 0x0 board_id: MT_0000000531 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet

libin817927 avatar May 29 '25 06:05 libin817927

hint_gid_index is for specific environments, could you run ibv_devinfo -v let's see which gid should be used in your env

Jeffwan avatar May 30 '25 16:05 Jeffwan

@thesues have you tested mlx5_bond_0 setup?

Jeffwan avatar May 30 '25 18:05 Jeffwan

@thesues have you tested mlx5_bond_0 setup?

no, we thought a single RDMA is good enough

thesues avatar May 31 '25 03:05 thesues