Woosuk Kwon comments

Results 151 comments of


                                            Woosuk Kwon

Question regarding PagedAttention Layer and Attention Processing for Batches

Hi @JinuJeong, thanks for your interest and good question! vLLM does have the splitting mechanism, and never changes the semantics of the model. Our `InputMetdata` includes the metadata to identify...

How to load a local model file?

@BUAADreamer @liulfy The `model` argument in `LLM` or `api_server` can also take the path to your local directory that contains the weight files.

How to load a local model file?

@BUAADreamer Thanks for providing the example. It should not use the remote HF repo if the path is valid. Could you try this and let us know if it works?...

Question about sampler. It takes too much time

@sleepwalker2017 Thanks for trying out vLLM and reporting the performance issue! Yes, our sampler is indeed not optimized well yet. Particularly, vLLM performs sampling for one request at a time,...

Add support for MPT

@emsi Thanks for reporting it! Your beam search output looks very weird. We'll investigate it, but I believe if that is really a bug then the bug should be in...

install vllm,but show CUDA runtime is found,please help me,thank you!

It seems there's an error in parsing the output of `nvcc -V`. Could you run `nvcc -V` and tell us the output?

install vllm,but show CUDA runtime is found,please help me,thank you!

@dongkuang Your output doesn't seem wrong. It might be a bug regarding the aarch64 architecture, which we haven't tested vLLM on. For now, I'm afraid we don't have any aarch64...

Why using LLM class to load models requires much more memory than using huggingface from_pretrained method?

Hi @canghongjian, thanks for trying out vLLM! vLLM runs a simple memory profiling and pre-allocates 90% of the total GPU memory for its weight and activation. You can configure this...

How to set ParallelConfig and SchedulerConfig?

vLLM currently does not support pipeline parallelism. The `ParallelConfig.pipeline_parallel_size` attribute is for future use. When multiple GPUs are used, vLLM leverages tensor parallelism to shard the model and inputs evenly...

Loading Models that require execution of third party code (trust_remote_code=True)

Hi @nearmax-p, could you [install vLLM from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source)? Then this error should disappear. Sorry for the inconvenience. We will update our pypi package very soon.