hahmad2008

Results 43 comments of hahmad2008
trafficstars

@aarnphm What is the difference between the previous two cases, so the first case can launch two processes one for ray worker and other for bentoml service (that when using...

@winglian , I used FSDP with qlora and the model still loaded as copied to the GPUs. I tried it with passing accelerate config and without and having the same...

seems I need to enable this `fsdp_offload_params: true`

@merrymercy thanks for the prompt response. it works with `--max-prefill 4096`. btw is the backend VLLM? what are the available backends? for tokenizer, how it should be if i didn't...

btw is there any factors that influence the concurrent requests should I check?

I have the same issue, with V0 I can serve mistral3.1-awq with 4k context length on 24G GPU but I have OOM if I use V1. [check here.](https://github.com/vllm-project/vllm/issues/16128#issuecomment-2782811982)

@paolovic @hmellor @DarkLight1337 could you please check this ticket related to vllm version 0.8.3? https://github.com/vllm-project/vllm/issues/16552

@majestichou @Nietism I have the same issue, model: `"deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B` my chat template as following: ``` "chat_template": "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{%...

@Nietism the `reasoning_context` is null plus sometime the first tag in the content is missing. How did you solve it?