nano-vllm
nano-vllm copied to clipboard
Nano vLLM
This patch is the 1st of the patch list where we show how nano-vllm performs on AMD platform.
HI @GeeeekExplorer Thanks for your great work on nano-vllm. I just tried on both AMD CDNA dataceter GPU and AMD RDNA3/4 desktop GPU, and it can work on both of...
Fix: Correct off-by-one error in KV-Cache block allocation This pull request addresses a critical off-by-one error in the BlockManager's logic for allocating new KV-Cache blocks during the decoding phase. The...
PR Description What does this PR do? This PR introduces full support for the Qwen2 large language model (LLM) in the project
In #71 #66 #65 #30 , there were questions about the timing of applying `can_append` and `may_append` for requesting new blocks. This PR will separate the logic for appending new...
Hi nano-vllm team, this is Yue, nice to meet you, I'm just a fan of this repo and learning it! As I learn the code of block_manager, IIUC the current...
The `can_append` function in the `BlockManager` returns a boolean that indicates whether we can store a sampled token for the given sequence. Currently, the code snippet `len(seq) % self.block_size ==...
This PR introduces a new benchmark script, `serving_bench.py`, to evaluate the engine's performance under a continuous load of incoming requests, simulating a real-world serving scenario. **Note:** This PR is purely...