Mooncake icon indicating copy to clipboard operation
Mooncake copied to clipboard

[Bug]: RDMA Device Misidentification in Container Environment

Open uniqueni opened this issue 1 month ago • 3 comments

Bug Report

Environment:

  • Hardware: H20 machine with 4 physical RDMA NICs
  • Container setup: 2 GPUs requesting 2 virtual RDMA devices

Issue: RDMA device discovery incorrectly identifies devices in the container environment.

Reproduction:

  1. Deploy container with 2 GPUs on H20 machine (4 physical RDMA NICs available)
  2. Request 2 virtual RDMA devices for the container
  3. Observe incorrect device identification and GID index lookup

Expected: Virtual RDMA devices should be correctly mapped and identified Actual: Device discovery fails to properly recognize the virtual RDMA devices

Image

Before submitting...

  • [ ] Ensure you searched for relevant issues and read the [documentation]

uniqueni avatar Nov 12 '25 10:11 uniqueni

@stmatengss I will fix this bug because I have the environment

uniqueni avatar Nov 12 '25 10:11 uniqueni

@stmatengss I will fix this bug because I have the environment

Thx!

stmatengss avatar Nov 13 '25 06:11 stmatengss

fixed https://github.com/kvcache-ai/Mooncake/pull/1077

uniqueni avatar Nov 24 '25 02:11 uniqueni