Ruben Orduz

Results 53 comments of Ruben Orduz

@geekwhocodes sure. If not me, someone in the team will. We're happy to help and enable community contributions.

@geekwhocodes thanks for this and apologies for the delay. @talagluck would you mind providing Ganesh with feedback about his suggested approach?

Hey @data-han thanks for raising this issue. Let me dig in for a bit and will get back to you.

@drnic we ran into several issues when trying to run in a container, so we just built our own as follows: Dockerfile: ``` FROM centos:7 RUN yum update -y RUN...

Although side-stepping the issue, which needs to be looked at and addressed, is it plausible to split the file into smaller ones, say, into 4 x ~450 Gb each?

@girving Oh I agree. I don't think file size is the issue. But the auth tokens expire ~24 hours and for some reason it's not getting refreshed at some point....

> Because the weights themselves aren't fp8 quantized yet, Scout can only run with bf16 weights. However, we do support dynamic quantization of the KV cache via `--kv-cache-dtype fp8`. Team...

> I am facing similar problems, despite I am disabling compile as [blog](https://blog.vllm.ai/2025/04/05/llama4.html) suggest. > > ## Command used > VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct --tokenizer meta-llama/Llama-4-Scout-17B-16E-Instruct --host "0.0.0.0" --port 5000...

> > Remove quantization and quant format from command. > > If I do this, the model will be loaded in full precision, so I will need a bigger GPU,...

@yeqcharlotte the issue has been resolved.