aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

Infinistore failed to work in the scale testing

Open Jeffwan opened this issue 7 months ago • 10 comments

🐛 Describe the bug

Image

Image

Steps to Reproduce

8 vllm instances, 960 GB Infinistore single instance

Expected behavior

it should work as expected

Environment

  • AIBrix version: v0.3.0-rc.1
apiVersion: v1
kind: Pod
metadata:
  name: infinistore
  namespace: default
  labels:
    app: infinistore
spec:
  hostNetwork: true
  containers:
  - name: infinistore
    image: aibrix-cn-beijing.cr.volces.com/aibrix/infinistore:20250506
    command: ["infinistore"]
    args:
      - "--manage-port=8088"
      - "--dev-name=mlx5_0" # if mlx5_0 is not available, it will fallbacks to other devs
      - "--service-port=12345"
      - "--link-type=Ethernet"
#      - "--hint-gid-index=7"
      - "--log-level=debug"
      - "--prealloc-size=950"
    ports:
      - containerPort: 8088
      - containerPort: 12345
    securityContext:
      privileged: true
    resources:
      requests:
        cpu: "10"
        memory: "960Gi"
      limits:
        cpu: "10"
        memory: "960Gi"
  restartPolicy: Never

          env:
            - name: VLLM_RPC_TIMEOUT
              value: "1000000"
            - name: VLLM_USE_V1
              value: "0"
            - name: AIBRIX_KV_CACHE_OL_L1_CACHE_ENABLED
              value: "0"
            - name: AIBRIX_KV_CACHE_OL_L2_CACHE_BACKEND
              value: "infinistore"
            - name: AIBRIX_KV_CACHE_OL_INFINISTORE_HOST_ADDR
              value: "192.168.3.35"
            - name: AIBRIX_KV_CACHE_OL_INFINISTORE_SERVICE_PORT
              value: "12345"
            - name: AIBRIX_KV_CACHE_OL_INFINISTORE_CONNECTION_TYPE
              value: "RDMA"
            - name: AIBRIX_KV_CACHE_OL_INFINISTORE_IB_PORT
              value: "1"
            - name: AIBRIX_KV_CACHE_OL_INFINISTORE_LINK_TYPE
              value: "Ethernet"
            - name: AIBRIX_KV_CACHE_OL_INFINISTORE_VISIBLE_DEV_LIST
              value: "mlx5_1:7,mlx5_2:7,mlx5_3:7,mlx5_4:7"

Jeffwan avatar May 19 '25 17:05 Jeffwan

Image

I can confirm that infinistore server is up before benchmarking, I also sent a warm up request before batch requests

Jeffwan avatar May 19 '25 20:05 Jeffwan

Image

the client didn't see result returned but seems the cache server is still working at this moment


Update: it's because I use debug level logs and it is still flushing my console

Jeffwan avatar May 19 '25 20:05 Jeffwan

client logs

Image

Jeffwan avatar May 19 '25 20:05 Jeffwan

server side logs

Image

Jeffwan avatar May 19 '25 21:05 Jeffwan

Image

Jeffwan avatar May 20 '25 05:05 Jeffwan

Image

Jeffwan avatar May 20 '25 05:05 Jeffwan

Name:             infinistore
Namespace:        default
Priority:         0
Service Account:  default
Node:             192.168.3.41/192.168.3.41
Start Time:       Mon, 19 May 2025 22:21:59 -0700
Labels:           app=infinistore
Annotations:      <none>
Status:           Failed
IP:               192.168.3.41
IPs:
  IP:  192.168.3.41
Containers:
  infinistore:
    Container ID:  containerd://2225e4b917b726fcf794ce13f7324b1a48fb633d1a8e42e114a161f98bb1f10b
    Image:         aibrix-cn-beijing.cr.volces.com/aibrix/infinistore:20250506
    Image ID:      aibrix-cn-beijing.cr.volces.com/aibrix/infinistore@sha256:9834d23afd302b0ca18f62befcecac7ebb4b19052251f075b714da3320c74135
    Ports:         8088/TCP, 12345/TCP
    Host Ports:    8088/TCP, 12345/TCP
    Command:
      /bin/bash
      -lc
      --
    Args:
      ulimit -n 65536; infinistore --manage-port=8088 --dev-name=mlx5_0 --service-port=12345 --link-type=Ethernet --prealloc-size=900;
    State:          Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 19 May 2025 22:22:00 -0700
      Finished:     Mon, 19 May 2025 22:32:05 -0700
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     10
      memory:  960Gi
    Requests:
      cpu:        10
      memory:     960Gi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dzg7g (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False
  Initialized                 True
  Ready                       False
  ContainersReady             False
  PodScheduled                True
Volumes:
  kube-api-access-dzg7g:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  10m   default-scheduler  Successfully assigned default/infinistore to 192.168.3.41
  Normal  Pulling    10m   kubelet            Pulling image "aibrix-cn-beijing.cr.volces.com/aibrix/infinistore:20250506"
  Normal  Pulled     10m   kubelet            Successfully pulled image "aibrix-cn-beijing.cr.volces.com/aibrix/infinistore:20250506" in 195ms (195ms including waiting). Image size: 6347134232 bytes.
  Normal  Created    10m   kubelet            Created container infinistore
  Normal  Started    10m   kubelet            Started container infinistore
  Normal  Killing    58s   kubelet            Stopping container infinistore

Jeffwan avatar May 20 '25 05:05 Jeffwan

I changed the allocation to 600g

Image

gradually it bumped to 650G eventually and never grows.

Image

root@iv-ydw7g96j9cbw80bvmnmx:/# cat /proc/1/smaps_rollup
56529a3d5000-7ffc40a47000 ---p 00000000 00:00 0                          [rollup]
Rss:            684302012 kB
Pss:            684300934 kB
Pss_Anon:       684277276 kB
Pss_File:          23658 kB
Pss_Shmem:             0 kB
Shared_Clean:       1736 kB
Shared_Dirty:          0 kB
Private_Clean:     23000 kB
Private_Dirty:  684277276 kB
Referenced:     684288104 kB
Anonymous:      684277276 kB
LazyFree:              0 kB
AnonHugePages:  516194304 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
root@iv-ydw7g96j9cbw80bvmnmx:/# pmap -x 1 | sort -k3 -n | tail -n 20
00007f2f1786e000    1092     712       0 r-x-- libstdc++.so.6.0.30
000056529a6f7000    2308     932       0 r---- python3.10
00007f2f17ee3000    1024    1008    1008 rw---   [ anon ]
00007f2f12231000    1028    1012    1012 rw---   [ anon ]
00007f2f14b00000    1024    1024    1024 rw---   [ anon ]
00007f2f176d4000    1024    1024    1024 rw---   [ anon ]
00007f2f1234e000    1364    1364       0 r-x-- loop.cpython-310-x86_64-linux-gnu.so
00007f2f182f2000    1620    1400       0 r-x-- libc.so.6
00007f2f1729e000    2428    1596       0 r-x-- libcrypto.so.3
00007f2f16fec000    2048    2048    2048 rw---   [ anon ]
00007f2f18008000    2440    2048    2048 rw---   [ anon ]
00007f2f11de8000    3108    2272       0 r-x-- _pydantic_core.cpython-310-x86_64-linux-gnu.so
00007f2f16678000    7788    2400       0 r-x-- _multiarray_umath.cpython-310-x86_64-linux-gnu.so
00007e98061ff000    2744    2744    2744 rw---   [ anon ]
000056529a442000    2772    2772       0 r-x-- python3.10
00007f2f124f5000    3116    3116    3116 rw---   [ anon ]
00007f2f11361000   10508   10344   10344 rw---   [ anon ]
000056529ab32000 52111508 52111168 52111168 rw---   [ anon ]                    -> 49.7GiB
00007e9807eaa000 633491692 633478372 633478372 rw---   [ anon ]         -> 603.9GiB
total kB         685820584 685640760 685616024

Jeffwan avatar May 20 '25 05:05 Jeffwan

Although we configured --prealloc-size=600 (GiB), we observed the actual resident memory (RES) usage reached ~653 GiB in top and /proc/1/smaps_rollup.

This does not indicate a memory leak, but is instead expected behavior due to the following reasons:

  • Main data pool, ~603 GiB, Slightly exceeds 600 GiB due to page faults, Transparent Huge Pages (THP), and memory alignment during physical page commitment.
  • Additional anonymous regions, ~50 GiB, Includes heap metadata, index structures, internal control buffers, and thread stacks.
  • Python runtime + system overhead , ~1–2 GiB ,

OS-level memory management (e.g., THP, page alignment, allocator fragmentation) causes actual memory usage to exceed logical allocation..

Jeffwan avatar May 20 '25 05:05 Jeffwan

Summary

  • ulimit settings inside the container are not a problem — the limits are already high, and lsof -p 1 | wc -l stays below 1000, which is within a reasonable range.

  • In the 1st run, the container was configured with a memory limit of 960Gi, and the allocation logic tried to use 950Gi. This resulted in a No buffer space available error. The root cause appears to be related to uvicorn signal handling — kubelet detected memory usage approaching the container limit and sent a SIGTERM. However, the application failed to handle SIGTERM properly, which led to the signal-related error.

  • In the 2nd run, the container memory limit remained at 960Gi, but the preallocation was reduced to 900Gi. This time the process was OOMKilled. The likely explanation is that although the container memory usage hadn't yet reached the cgroup limit, the kernel detected insufficient memory when attempting to allocate certain resources and directly sent a SIGKILL, bypassing kubelet.

  • In the 3rd run, the container limit stayed at 960Gi, but the preallocation was further reduced to 600Gi. The service ran successfully. Monitoring showed that peak memory usage reached around 650Gi. Besides the main data buffer (600Gi), pmap revealed ~50GiB of memory likely used by heap metadata, index structures, internal buffers, and thread stacks.

Jeffwan avatar May 20 '25 17:05 Jeffwan