cui36

Results 8 comments of cui36

> I tried reducing max_num_batched_tokens and it worked. The default value was 8192 and I reduced it to 1024 == max_model_len. There is a reduction of speed, though. Hi @alecngo,...

> The dataset length is still 200k. Some NDA so I would just quickly summarize the prompt as around 300 English words asking Qwen to say yes or no if...

Yes. When running the experiment, everything was fine at the beginning, but this bug appeared after a while. Trying to reproduce the issue.

Yes, it can be reproduced. Need to have a closer check.

Thanks @alecngo for using our system! I encountered this bug before but didn’t look into it further at the time. I’ll retest and try to reproduce the error.

Great! I think the sleep and wakeup functionality has already been merged, right? @jiarong0907

> > > @ztang2370 @cui36 Having vllm semantic router is great, but I would suggest we add it as a feature later. > > > For the example, we can...

I think the prompt `` can be modified to ``, like this: [vision_language.py](https://github.com/vllm-project/vllm/blob/535d80056b72443e68a96c1e4a1049cd9a85587d/examples/offline_inference/vision_language.py#L1382)