Sungjae Lee

Results 8 issues of Sungjae Lee

## 🐛 Bug Report Thanks for the great [help](https://github.com/grpc-ecosystem/grpc-gateway/issues/837#issuecomment-1080699455) and [guide](https://grpc-ecosystem.github.io/grpc-gateway/docs/mapping/customizing_openapi_output/#merging-output), I could merge swagger outputs of different services. By the way, the problem is that the merged output only...

bug
help wanted
openapi
good first issue

## 🐛 Bug Report When I split a monolithic single service into multiple services and use them to generate a single swagger file, it seems that the numbering logic of...

I found that unfused attention kernels (softmax, transpose..) can support sequence length of 32k and are largely resilient to overflow issues. However, the `addRelativeAttentionBiasUnaligned` kernel employs an integer data type...

## issues https://github.com/ray-project/llmperf/issues/43 https://github.com/ray-project/llmperf/issues/56 ## Summary - Subsequent requests cannot be sent until whole requests have all finished even in non-block mode. - Fixing the request launcher was challenging due...

Hello, I've encountered an issue where the request launcher does not allow the next requests to be sent until all requests specified by `num_concurrent_requests` have finished. This behavior seems counterintuitive...

drafts with RFC: https://github.com/vllm-project/vllm/issues/8333 --- PR Checklist (Click to Expand) Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria....

### Motivation. - When using automatic prefix caching that manages blocks in an LRU (Least Recently Used) manner, it would be useful to add a pinned caching feature, where blocks...

RFC

## Summary Block Manager v2, unlike v1, did not support LoRA and prompt adapter for the block hash in prefix caching mode. I added logic to inject the LoRA ID...