Woosuk Kwon

Results 151 comments of Woosuk Kwon

@nearmax-p thanks for reporting it. Could you share how large your CPU memory is? It seems such a bug occurs when the CPU memory is not enough. We haven't succeeded...

@nearmax-p Then it's very weird. We've tested the model on the exactly same setup. Which type of disk are you using? And if possible, could you re-install vLLM and try...

@nearmax-p If you are using docker, could you try increasing the shared memory size (e.g., to 64G?)? ```bash docker run --gpus all -it --rm --shm-size=64g nvcr.io/nvidia/pytorch:22.12-py3 ```

Hi @mshumer Could you provide a reproducible example?

According to my experiments, the PR not only reduces the latency, but also increases the throughput by ~7%. Great work!

@scv119 BTW, I think the title of the PR is misleading; I think the PR changes the expert parallelism into tensor parallelism which was the original implementation by Mistral AI.

Hi @scv119, thanks for addressing my comments! I haven't actually completed the review yet. Will add more tonight or tmr morning.

Hi @dhritiman, thanks for trying out vLLM. Could you try `--tensor_parallel_size 1` and see if it works?

Hi @hongxiayang, thanks for submitting this PR! Please let us know when the PR is ready for review.