No such file or directory: '/dev/shm/VLLM'
@jiarong0907 When doing the experiment, the engine crashes due to a missing shared memory file /dev/shm/VLLM. Have you ever met the same issue before? Thanks!
File "/workspace/kvcached/engine_integration/vllm-v0.9.2/vllm/v1/core/single_type_kv_cache_manager.py", line 127, in allocate_new_blocks
new_blocks = self.block_pool.get_new_blocks(num_new_blocks)
File "/workspace/kvcached/engine_integration/vllm-v0.9.2/vllm/v1/core/block_pool.py", line 417, in get_new_blocks
block_ids = self.kv_cache_manager.alloc(num_blocks)
File "/workspace/kvcached/kvcached/kv_cache_manager.py", line 147, in alloc
return self._alloc(need_size)
File "/workspace/kvcached/kvcached/kv_cache_manager.py", line 158, in _alloc
new_mem_size = self.page_allocator.mem_info_tracker.check_and_get_resize_target(
File "/workspace/kvcached/kvcached/mem_info_tracker.py", line 41, in check_and_get_resize_target
with RwLockedShm(self.ipc_name, MemInfoStruct.SHM_SIZE,
File "/workspace/kvcached/kvcached/cli/utils.py", line 75, in __enter__
self.file = open(self.file_path, self.mode)
FileNotFoundError: [Errno 2] No such file or directory: '/dev/shm/VLLM'
@cui36 I believe this file will be created automatically when kvcached it called. https://github.com/ovg-project/kvcached/blob/main/kvcached/mem_info_tracker.py#L29
Are you using vLLM with kvcached?
Yes. When running the experiment, everything was fine at the beginning, but this bug appeared after a while. Trying to reproduce the issue.
Yes, it can be reproduced. Need to have a closer check.
I also encounter this issue @jiarong0907 .
Hi @cui36 @alecngo, could you explain a bit more under what cases you got this problem? Usually, this IPC file is not needed IIRC.
Thanks @alecngo for using our system! I encountered this bug before but didn’t look into it further at the time. I’ll retest and try to reproduce the error.
Hi all, so my case I have three vLLM engines running on 1 GPU and it is totally fresh environment since the logic is inside a docker image without any mounting. I set three vars as follow:
ENABLE_KVCACHED=true KVCACHED_AUTOPATCH=1 KVCACHED_IPC_NAME=VLLM
Let's say I have to send 5 batches into the engine, the first 4 batches seem to go through well with good output back. But at the very last batch and as the engine is about to be killed, this error happens and kills my process.
Hi all, so when I tried to comment out KVCACHED_IPC_NAME=VLLM, I got another issue: (EngineCore_DP0 pid=9010) File "/opt/venv/lib/python3.12/site-packages/kvcached/kv_cache_manager.py", line 158, in _alloc (EngineCore_DP0 pid=9010) new_mem_size = self.page_allocator.mem_info_tracker.check_and_get_resize_target( (EngineCore_DP0 pid=9010) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=9010) File "/opt/venv/lib/python3.12/site-packages/kvcached/mem_info_tracker.py", line 41, in check_and_get_resize_target (EngineCore_DP0 pid=9010) with RwLockedShm(self.ipc_name, MemInfoStruct.SHM_SIZE, (EngineCore_DP0 pid=9010) File "/opt/venv/lib/python3.12/site-packages/kvcached/cli/utils.py", line 75, in enter (EngineCore_DP0 pid=9010) self.file = open(self.file_path, self.mode) (EngineCore_DP0 pid=9010) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=9010) FileNotFoundError: [Errno 2] No such file or directory: '/dev/shm/kvcached_mem_info'
Thanks @alecngo! I would suggest trying different IPC_NAMEs for different engines. kvcached_mem_info is the default IPC name and could get conflicts.
This could be because kvcached assumes co-running engines must have different IPC_NAMEs, and they will try to release the IPC shm during engine exit.
Thanks for raising this. We could improve the code robustness here.
Alright this is the fix! Thanks so much! I think it would be helpful to have some random uuid append to the IPC name which could avoid this. I added some index after the default name and it works just right.
@cui36 @alecngo The problem should be fixed by PR https://github.com/ovg-project/kvcached/pull/192.
The problem is that when starting multiple engines without explicitly seting the IPC name, they will share the same IPC file. If one engine finishes and exists, it will delete that IPC file which causes the unfinished engines unable to find the IPC file for deletion.