Simon Mo comments

Results 313 comments of


                                            Simon Mo

[Usage]: Is it possible to generate without detokenizing?

I think this can be a good idea. Are you thinking about offline evaluation using the `LLM` interface or the server? Thoughts about this? @Yard1 @zhuohan123

[Usage]: Is it possible to generate without detokenizing?

@GeauxEric please feel free to open a PR so it's easier to get feedback.

[ROCm] Add support for Punica kernels on AMD GPUs

This script can help verify this works end to end https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py

Add option to completion API to truncate prompt tokens

> My previous comment is about truncation side, as for various reasons/formats we'd either want to trim from the left or right as well and since it already a parameter...

Add option to completion API to truncate prompt tokens

I trust @njhill to decide and merge.

[Misc]: Total number of attention heads (40) must be divisible by tensor parallel size (6)

Doing tp=4 is the most effective fix.

[docs][Ray cluster] How to secure the jobs submission port?

@architkulkarni

Adding support for encoder-decoder models, like T5 or BART

cc @rib-2

Support for Constrained decoding

We now support full range of constrained/guided decoding as powered by Outlines, closing this as completed

when to support chatglm2-6b?

If anyone have bandwidth to help us implement ChatGLM support, please leave a comment and coordinate here: https://github.com/vllm-project/vllm/issues/1552