Alec Ngo
Alec Ngo
I tried reducing max_num_batched_tokens and it worked. The default value was 8192 and I reduced it to 1024 == max_model_len. There is a reduction of speed, though.
The dataset length is still 200k. Some NDA so I would just quickly summarize the prompt as around 300 English words asking Qwen to say yes or no if a...
I also encounter this issue @jiarong0907 .
Hi all, so my case I have three vLLM engines running on 1 GPU and it is totally fresh environment since the logic is inside a docker image without any...
Hi all, so when I tried to comment out KVCACHED_IPC_NAME=VLLM, I got another issue: (EngineCore_DP0 pid=9010) File "/opt/venv/lib/python3.12/site-packages/kvcached/kv_cache_manager.py", line 158, in _alloc (EngineCore_DP0 pid=9010) new_mem_size = self.page_allocator.mem_info_tracker.check_and_get_resize_target( (EngineCore_DP0 pid=9010) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^...
Alright this is the fix! Thanks so much! I think it would be helpful to have some random uuid append to the IPC name which could avoid this. I added...
Hi @ivanium , I was using 3 instances of qwen2 7B in a A100_80GB. I did not set the gpu_memory_utilization so by default it should be 90%. I thought kvcached...
I am correcting myself, looking at the engine setup I set the gpu_memory_utilization to 0.5 and it failed. Increasing to 0.8 helped. I think we can call it close unless...
Seems like we need to revisit this at some point. I did not set gpu memory utilization but it still is a miss-or-hit. There are other variables that can factor...