Roger Wang
Roger Wang
Similar to what @ikalista mentioned in original discussion, imo a better way is to mount a model storage to the container for model loading unless we want to rewrite the...
> @ywang96 is anybody working on the direct model loading, do we have a benchmark between mounting and directly loading to memory? Happy to work on this if nobody else...
@rabaja Can you share what's inside `./benchmark_serving.sh`? I cannot repro this with our benchmark script in the main branch. my server launch command: ``` vllm serve meta-llama/Llama-3.1-8B-Instruct ``` Benchmark launch...
It would be great if you can clone the latest main branch and just confirm that the benchmark script works for you.
@p88h This is amazing! Have you tried running some benchmarks to see the throughput performance impact of this PR?
cc @youkaichao
@DarkLight1337 @WoosukKwon Here's a short repro script - let me know if this is reasonable. ```python import time from vllm import LLM st = time.perf_counter() llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct", enforce_eager=True) print("Time...
Oops - I forgot to turn on ready label and auto-merge. Doing it now!
We're a bit overwhelmed by things to work on, so any help/contribution is definitely welcomed! Supporting this model should be straightforward since it's also LlaVA-style like many other VLMs we...
> does the support included in the release 0.6.2? @premg16 0.6.2 has already been released, so no, but we will make a new release when this model is supported by...