kvcached icon indicating copy to clipboard operation
kvcached copied to clipboard

No such file or directory: '/dev/shm/VLLM'

Open cui36 opened this issue 2 months ago • 11 comments

@jiarong0907 When doing the experiment, the engine crashes due to a missing shared memory file /dev/shm/VLLM. Have you ever met the same issue before? Thanks!

File "/workspace/kvcached/engine_integration/vllm-v0.9.2/vllm/v1/core/single_type_kv_cache_manager.py", line 127, in allocate_new_blocks
    new_blocks = self.block_pool.get_new_blocks(num_new_blocks)
File "/workspace/kvcached/engine_integration/vllm-v0.9.2/vllm/v1/core/block_pool.py", line 417, in get_new_blocks
    block_ids = self.kv_cache_manager.alloc(num_blocks)
File "/workspace/kvcached/kvcached/kv_cache_manager.py", line 147, in alloc
    return self._alloc(need_size)
File "/workspace/kvcached/kvcached/kv_cache_manager.py", line 158, in _alloc
    new_mem_size = self.page_allocator.mem_info_tracker.check_and_get_resize_target(
File "/workspace/kvcached/kvcached/mem_info_tracker.py", line 41, in check_and_get_resize_target
    with RwLockedShm(self.ipc_name, MemInfoStruct.SHM_SIZE,
File "/workspace/kvcached/kvcached/cli/utils.py", line 75, in __enter__
    self.file = open(self.file_path, self.mode)
FileNotFoundError: [Errno 2] No such file or directory: '/dev/shm/VLLM'

cui36 avatar Sep 28 '25 05:09 cui36

@cui36 I believe this file will be created automatically when kvcached it called. https://github.com/ovg-project/kvcached/blob/main/kvcached/mem_info_tracker.py#L29

Are you using vLLM with kvcached?

jiarong0907 avatar Sep 28 '25 18:09 jiarong0907

Yes. When running the experiment, everything was fine at the beginning, but this bug appeared after a while. Trying to reproduce the issue.

cui36 avatar Sep 28 '25 22:09 cui36

Yes, it can be reproduced. Need to have a closer check.

cui36 avatar Sep 29 '25 06:09 cui36

I also encounter this issue @jiarong0907 .

alecngo avatar Oct 25 '25 08:10 alecngo

Hi @cui36 @alecngo, could you explain a bit more under what cases you got this problem? Usually, this IPC file is not needed IIRC.

jiarong0907 avatar Oct 25 '25 14:10 jiarong0907

Thanks @alecngo for using our system! I encountered this bug before but didn’t look into it further at the time. I’ll retest and try to reproduce the error.

cui36 avatar Oct 25 '25 14:10 cui36

Hi all, so my case I have three vLLM engines running on 1 GPU and it is totally fresh environment since the logic is inside a docker image without any mounting. I set three vars as follow:

ENABLE_KVCACHED=true KVCACHED_AUTOPATCH=1 KVCACHED_IPC_NAME=VLLM

Let's say I have to send 5 batches into the engine, the first 4 batches seem to go through well with good output back. But at the very last batch and as the engine is about to be killed, this error happens and kills my process.

alecngo avatar Oct 25 '25 16:10 alecngo

Hi all, so when I tried to comment out KVCACHED_IPC_NAME=VLLM, I got another issue: (EngineCore_DP0 pid=9010) File "/opt/venv/lib/python3.12/site-packages/kvcached/kv_cache_manager.py", line 158, in _alloc (EngineCore_DP0 pid=9010) new_mem_size = self.page_allocator.mem_info_tracker.check_and_get_resize_target( (EngineCore_DP0 pid=9010) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=9010) File "/opt/venv/lib/python3.12/site-packages/kvcached/mem_info_tracker.py", line 41, in check_and_get_resize_target (EngineCore_DP0 pid=9010) with RwLockedShm(self.ipc_name, MemInfoStruct.SHM_SIZE, (EngineCore_DP0 pid=9010) File "/opt/venv/lib/python3.12/site-packages/kvcached/cli/utils.py", line 75, in enter (EngineCore_DP0 pid=9010) self.file = open(self.file_path, self.mode) (EngineCore_DP0 pid=9010) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=9010) FileNotFoundError: [Errno 2] No such file or directory: '/dev/shm/kvcached_mem_info'

alecngo avatar Oct 25 '25 17:10 alecngo

Thanks @alecngo! I would suggest trying different IPC_NAMEs for different engines. kvcached_mem_info is the default IPC name and could get conflicts.

This could be because kvcached assumes co-running engines must have different IPC_NAMEs, and they will try to release the IPC shm during engine exit.

Thanks for raising this. We could improve the code robustness here.

ivanium avatar Oct 25 '25 17:10 ivanium

Alright this is the fix! Thanks so much! I think it would be helpful to have some random uuid append to the IPC name which could avoid this. I added some index after the default name and it works just right.

alecngo avatar Oct 25 '25 17:10 alecngo

@cui36 @alecngo The problem should be fixed by PR https://github.com/ovg-project/kvcached/pull/192.

The problem is that when starting multiple engines without explicitly seting the IPC name, they will share the same IPC file. If one engine finishes and exists, it will delete that IPC file which causes the unfinished engines unable to find the IPC file for deletion.

jiarong0907 avatar Oct 25 '25 20:10 jiarong0907