Sungjae Lee
Sungjae Lee
## 🐛 Bug Report Thanks for the great [help](https://github.com/grpc-ecosystem/grpc-gateway/issues/837#issuecomment-1080699455) and [guide](https://grpc-ecosystem.github.io/grpc-gateway/docs/mapping/customizing_openapi_output/#merging-output), I could merge swagger outputs of different services. By the way, the problem is that the merged output only...
## 🐛 Bug Report When I split a monolithic single service into multiple services and use them to generate a single swagger file, it seems that the numbering logic of...
I found that unfused attention kernels (softmax, transpose..) can support sequence length of 32k and are largely resilient to overflow issues. However, the `addRelativeAttentionBiasUnaligned` kernel employs an integer data type...
## issues https://github.com/ray-project/llmperf/issues/43 https://github.com/ray-project/llmperf/issues/56 ## Summary - Subsequent requests cannot be sent until whole requests have all finished even in non-block mode. - Fixing the request launcher was challenging due...
Hello, I've encountered an issue where the request launcher does not allow the next requests to be sent until all requests specified by `num_concurrent_requests` have finished. This behavior seems counterintuitive...
drafts with RFC: https://github.com/vllm-project/vllm/issues/8333 --- PR Checklist (Click to Expand) Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria....
### Motivation. - When using automatic prefix caching that manages blocks in an LRU (Least Recently Used) manner, it would be useful to add a pinned caching feature, where blocks...
[Core] support LoRA and prompt adapter in content-based hashing for Block Manager v2 prefix caching
## Summary Block Manager v2, unlike v1, did not support LoRA and prompt adapter for the block hash in prefix caching mode. I added logic to inject the LoRA ID...