Infinistore failed to work in the scale testing
🐛 Describe the bug
Steps to Reproduce
8 vllm instances, 960 GB Infinistore single instance
Expected behavior
it should work as expected
Environment
- AIBrix version: v0.3.0-rc.1
apiVersion: v1
kind: Pod
metadata:
name: infinistore
namespace: default
labels:
app: infinistore
spec:
hostNetwork: true
containers:
- name: infinistore
image: aibrix-cn-beijing.cr.volces.com/aibrix/infinistore:20250506
command: ["infinistore"]
args:
- "--manage-port=8088"
- "--dev-name=mlx5_0" # if mlx5_0 is not available, it will fallbacks to other devs
- "--service-port=12345"
- "--link-type=Ethernet"
# - "--hint-gid-index=7"
- "--log-level=debug"
- "--prealloc-size=950"
ports:
- containerPort: 8088
- containerPort: 12345
securityContext:
privileged: true
resources:
requests:
cpu: "10"
memory: "960Gi"
limits:
cpu: "10"
memory: "960Gi"
restartPolicy: Never
env:
- name: VLLM_RPC_TIMEOUT
value: "1000000"
- name: VLLM_USE_V1
value: "0"
- name: AIBRIX_KV_CACHE_OL_L1_CACHE_ENABLED
value: "0"
- name: AIBRIX_KV_CACHE_OL_L2_CACHE_BACKEND
value: "infinistore"
- name: AIBRIX_KV_CACHE_OL_INFINISTORE_HOST_ADDR
value: "192.168.3.35"
- name: AIBRIX_KV_CACHE_OL_INFINISTORE_SERVICE_PORT
value: "12345"
- name: AIBRIX_KV_CACHE_OL_INFINISTORE_CONNECTION_TYPE
value: "RDMA"
- name: AIBRIX_KV_CACHE_OL_INFINISTORE_IB_PORT
value: "1"
- name: AIBRIX_KV_CACHE_OL_INFINISTORE_LINK_TYPE
value: "Ethernet"
- name: AIBRIX_KV_CACHE_OL_INFINISTORE_VISIBLE_DEV_LIST
value: "mlx5_1:7,mlx5_2:7,mlx5_3:7,mlx5_4:7"
I can confirm that infinistore server is up before benchmarking, I also sent a warm up request before batch requests
the client didn't see result returned but seems the cache server is still working at this moment
Update: it's because I use debug level logs and it is still flushing my console
client logs
server side logs
Name: infinistore
Namespace: default
Priority: 0
Service Account: default
Node: 192.168.3.41/192.168.3.41
Start Time: Mon, 19 May 2025 22:21:59 -0700
Labels: app=infinistore
Annotations: <none>
Status: Failed
IP: 192.168.3.41
IPs:
IP: 192.168.3.41
Containers:
infinistore:
Container ID: containerd://2225e4b917b726fcf794ce13f7324b1a48fb633d1a8e42e114a161f98bb1f10b
Image: aibrix-cn-beijing.cr.volces.com/aibrix/infinistore:20250506
Image ID: aibrix-cn-beijing.cr.volces.com/aibrix/infinistore@sha256:9834d23afd302b0ca18f62befcecac7ebb4b19052251f075b714da3320c74135
Ports: 8088/TCP, 12345/TCP
Host Ports: 8088/TCP, 12345/TCP
Command:
/bin/bash
-lc
--
Args:
ulimit -n 65536; infinistore --manage-port=8088 --dev-name=mlx5_0 --service-port=12345 --link-type=Ethernet --prealloc-size=900;
State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Mon, 19 May 2025 22:22:00 -0700
Finished: Mon, 19 May 2025 22:32:05 -0700
Ready: False
Restart Count: 0
Limits:
cpu: 10
memory: 960Gi
Requests:
cpu: 10
memory: 960Gi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dzg7g (ro)
Conditions:
Type Status
PodReadyToStartContainers False
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-dzg7g:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 10m default-scheduler Successfully assigned default/infinistore to 192.168.3.41
Normal Pulling 10m kubelet Pulling image "aibrix-cn-beijing.cr.volces.com/aibrix/infinistore:20250506"
Normal Pulled 10m kubelet Successfully pulled image "aibrix-cn-beijing.cr.volces.com/aibrix/infinistore:20250506" in 195ms (195ms including waiting). Image size: 6347134232 bytes.
Normal Created 10m kubelet Created container infinistore
Normal Started 10m kubelet Started container infinistore
Normal Killing 58s kubelet Stopping container infinistore
I changed the allocation to 600g
gradually it bumped to 650G eventually and never grows.
root@iv-ydw7g96j9cbw80bvmnmx:/# cat /proc/1/smaps_rollup
56529a3d5000-7ffc40a47000 ---p 00000000 00:00 0 [rollup]
Rss: 684302012 kB
Pss: 684300934 kB
Pss_Anon: 684277276 kB
Pss_File: 23658 kB
Pss_Shmem: 0 kB
Shared_Clean: 1736 kB
Shared_Dirty: 0 kB
Private_Clean: 23000 kB
Private_Dirty: 684277276 kB
Referenced: 684288104 kB
Anonymous: 684277276 kB
LazyFree: 0 kB
AnonHugePages: 516194304 kB
ShmemPmdMapped: 0 kB
FilePmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
Locked: 0 kB
root@iv-ydw7g96j9cbw80bvmnmx:/# pmap -x 1 | sort -k3 -n | tail -n 20
00007f2f1786e000 1092 712 0 r-x-- libstdc++.so.6.0.30
000056529a6f7000 2308 932 0 r---- python3.10
00007f2f17ee3000 1024 1008 1008 rw--- [ anon ]
00007f2f12231000 1028 1012 1012 rw--- [ anon ]
00007f2f14b00000 1024 1024 1024 rw--- [ anon ]
00007f2f176d4000 1024 1024 1024 rw--- [ anon ]
00007f2f1234e000 1364 1364 0 r-x-- loop.cpython-310-x86_64-linux-gnu.so
00007f2f182f2000 1620 1400 0 r-x-- libc.so.6
00007f2f1729e000 2428 1596 0 r-x-- libcrypto.so.3
00007f2f16fec000 2048 2048 2048 rw--- [ anon ]
00007f2f18008000 2440 2048 2048 rw--- [ anon ]
00007f2f11de8000 3108 2272 0 r-x-- _pydantic_core.cpython-310-x86_64-linux-gnu.so
00007f2f16678000 7788 2400 0 r-x-- _multiarray_umath.cpython-310-x86_64-linux-gnu.so
00007e98061ff000 2744 2744 2744 rw--- [ anon ]
000056529a442000 2772 2772 0 r-x-- python3.10
00007f2f124f5000 3116 3116 3116 rw--- [ anon ]
00007f2f11361000 10508 10344 10344 rw--- [ anon ]
000056529ab32000 52111508 52111168 52111168 rw--- [ anon ] -> 49.7GiB
00007e9807eaa000 633491692 633478372 633478372 rw--- [ anon ] -> 603.9GiB
total kB 685820584 685640760 685616024
Although we configured --prealloc-size=600 (GiB), we observed the actual resident memory (RES) usage reached ~653 GiB in top and /proc/1/smaps_rollup.
This does not indicate a memory leak, but is instead expected behavior due to the following reasons:
- Main data pool, ~603 GiB, Slightly exceeds 600 GiB due to page faults, Transparent Huge Pages (THP), and memory alignment during physical page commitment.
- Additional anonymous regions, ~50 GiB, Includes heap metadata, index structures, internal control buffers, and thread stacks.
- Python runtime + system overhead , ~1–2 GiB ,
OS-level memory management (e.g., THP, page alignment, allocator fragmentation) causes actual memory usage to exceed logical allocation..
Summary
-
ulimit settings inside the container are not a problem — the limits are already high, and lsof -p 1 | wc -l stays below 1000, which is within a reasonable range.
-
In the 1st run, the container was configured with a memory limit of 960Gi, and the allocation logic tried to use 950Gi. This resulted in a No buffer space available error. The root cause appears to be related to uvicorn signal handling — kubelet detected memory usage approaching the container limit and sent a SIGTERM. However, the application failed to handle SIGTERM properly, which led to the signal-related error.
-
In the 2nd run, the container memory limit remained at 960Gi, but the preallocation was reduced to 900Gi. This time the process was OOMKilled. The likely explanation is that although the container memory usage hadn't yet reached the cgroup limit, the kernel detected insufficient memory when attempting to allocate certain resources and directly sent a SIGKILL, bypassing kubelet.
-
In the 3rd run, the container limit stayed at 960Gi, but the preallocation was further reduced to 600Gi. The service ran successfully. Monitoring showed that peak memory usage reached around 650Gi. Besides the main data buffer (600Gi), pmap revealed ~50GiB of memory likely used by heap metadata, index structures, internal buffers, and thread stacks.