Woosuk Kwon comments

Results 151 comments of


                                            Woosuk Kwon

Loading Models that require execution of third party code (trust_remote_code=True)

@nearmax-p thanks for reporting it. Could you share how large your CPU memory is? It seems such a bug occurs when the CPU memory is not enough. We haven't succeeded...

Loading Models that require execution of third party code (trust_remote_code=True)

@nearmax-p Then it's very weird. We've tested the model on the exactly same setup. Which type of disk are you using? And if possible, could you re-install vLLM and try...

Loading Models that require execution of third party code (trust_remote_code=True)

@nearmax-p Thanks! That would be very helpful.

Loading Models that require execution of third party code (trust_remote_code=True)

@nearmax-p If you are using docker, could you try increasing the shared memory size (e.g., to 64G?)? ```bash docker run --gpus all -it --rm --shm-size=64g nvcr.io/nvidia/pytorch:22.12-py3 ```

ValueError: Double free!

Hi @mshumer Could you provide a reproducible example?

tensor parallel MOE implementation

According to my experiments, the PR not only reduces the latency, but also increases the throughput by ~7%. Great work!

tensor parallel MOE implementation

@scv119 BTW, I think the title of the PR is misleading; I think the PR changes the expert parallelism into tensor parallelism which was the original implementation by Mistral AI.

tensor parallel MOE implementation

Hi @scv119, thanks for addressing my comments! I haven't actually completed the review yet. Will add more tonight or tmr morning.

SIGABRT - Fatal Python error: Aborted when running vllm on llama2-7b with --tensor-parallel-size 2

Hi @dhritiman, thanks for trying out vLLM. Could you try `--tensor_parallel_size 1` and see if it works?

[ROCm] add support to ROCm 6.0 and MI300

Hi @hongxiayang, thanks for submitting this PR! Please let us know when the PR is ready for review.