Woosuk Kwon
Woosuk Kwon
@nearmax-p thanks for reporting it. Could you share how large your CPU memory is? It seems such a bug occurs when the CPU memory is not enough. We haven't succeeded...
@nearmax-p Then it's very weird. We've tested the model on the exactly same setup. Which type of disk are you using? And if possible, could you re-install vLLM and try...
@nearmax-p Thanks! That would be very helpful.
@nearmax-p If you are using docker, could you try increasing the shared memory size (e.g., to 64G?)? ```bash docker run --gpus all -it --rm --shm-size=64g nvcr.io/nvidia/pytorch:22.12-py3 ```
Hi @mshumer Could you provide a reproducible example?
According to my experiments, the PR not only reduces the latency, but also increases the throughput by ~7%. Great work!
@scv119 BTW, I think the title of the PR is misleading; I think the PR changes the expert parallelism into tensor parallelism which was the original implementation by Mistral AI.
Hi @scv119, thanks for addressing my comments! I haven't actually completed the review yet. Will add more tonight or tmr morning.
Hi @dhritiman, thanks for trying out vLLM. Could you try `--tensor_parallel_size 1` and see if it works?
Hi @hongxiayang, thanks for submitting this PR! Please let us know when the PR is ready for review.