paolovic

Results 51 comments of paolovic

Hi @manitadayon , is it possible that you experienced your OOM error while computing the CUDA graph? Because `enforce_eager=True` is a way to circumvent this particular OOM during CUDA graph...

alright, I'm quantizing the nvidia/Llama-3_3-Nemotron-Super-49B-v1 to 4bit GPTQ right now

Hi @manitadayon , nice I was able to reproduce the error. Same machine, 2x Nvidia L40s, `vllm 0.8.3` 1. V0 works as follows: ```bash CUDA_VISIBLE_DEVICES=0 VLLM_USE_V1=0 vllm serve Llama-3_3-Nemotron-Super-49B-v1-4bit-GPTQ/ --trust-remote-code...

> Thank you. Oh, you are able to run the model on 1 GPU in V0 version (only 48GB memory)? (since you set the CUDA visible device to only 0)....

Hoi @casper-hansen , alright, as soon as I have time for that, I will dig into it. First, I assume I'll have to master Ray. Thank you for the quick...

nice work, thank you @sapristi !

The same holds for me as also described in here https://github.com/vllm-project/vllm/issues/4416 When trying to load a GGUF model, e.g., https://huggingface.co/bartowski/reader-lm-1.5b-GGUF , vLLM requires a `config.json` although the new (?) GGUF...

> Hey @paolovic, > > Yes, this error occurs because vLLM is currently not looking for `.gguf` files inside the folder but instead assumes you pass the `model` as the...