Salomón Mejía
Salomón Mejía
Use this tutorial to run it in a docker container. It worked for me. https://github.com/vllm-project/vllm/issues/14452
I am using a RTX 5080 and 5090 but both shown same error.
If you're looking for a smooth way to deploy vLLM with 4-bit quantization, here's a solid base setup using Docker, NVIDIA's optimized image and bitsandbytes. We're starting from this base...
I downloaded it in a pc with free internet and transfer to the server with STT. that can help
I had same issue but it is solved on https://github.com/vllm-project/vllm/issues/14452. You will need use docker to run it.
Hi @thangnguyenduc1-vti, how do you manage to integrate an 8B model into 16GB of VRAM? I understand that even if you're using a dual-system, you don't have 32GB of VRAM...