Model Llama-4-Scout-17B-16E-Instruct GGUF fails to load with Multimodal support
Describe the bug
Running the stated model+mmproj fails to load. No error message is shown. It just hangs.
You'll hit the problem with either of these Unsloth quants of Llama-4-Scout-17B-16E-Instruct:
- https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF/tree/main/UD-Q6_K_XL or
- https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF/tree/main/UD-Q8_K_XL
You can use either of the mmproj files:
- https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF/blob/main/mmproj-BF16.gguf or
- https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF/blob/main/mmproj-F16.gguf
All files were verified via SHA256 hashes.
Is there an existing issue for this?
- [x] I have searched the existing issues
Reproduction
Follow the Multimodal WIKI:
- Select the GGUF model file in the dropdown on the Model page
- Select the F16 GGUF mmproj from the picker
- Click load to load the model
The same model loads fine if no mmproj file is selected.
If you watch VRAM and RAM, you'll see the model finish loading, but the llama-server seems to stop, and the model never announces it's done loading so you can't chat. If I run Filemon to see what's happening, I see that when the main model file is done loading, it starts to load the mmproj file, loads about 25K of it, and then closes the file and hangs.
These seems to happen with both quants listed, and with both mmproj files listed. I have dual GPUs, a 5060 an 4060, but setting CUDA_VISIBLE_DEVICES to limit it to 1 GPU in the shell before running the .bat doesn't seem to fix it.
My guess is, whatever's reading the mmproj file doesn't like it and bails out without reporting the error.
Screenshot
No response
Logs
No error is reported on the screen. Loading starts normally, but never finishes.
System Info
Windows 10, Intel 13900K, 128GB RAM, 5060 Ti 16GB + 4060 Ti 16GB
I get the same issue with mistral small 24B in GGUF format. The loading freezes, but from VRAM usage it appears only Mistral 24B loads and the program seems to get stuck on the mmproj. I've installed via portable, one click and manual (python 3.13, 3.11, 3.10) methods and the problem persists.
W10, 3090, 9950X, 96GB DDR5
Looks like this is caused by a deadlock in the error output stream from the underlying llama cpp process.
I tried to debug the issue by making this change to llama_cpp_server.py:
logger.info(f"Using gpu_layers={shared.args.gpu_layers} | ctx_size={shared.args.ctx_size} | cache_type={cache_type}")
# Start the server with pipes for output
self.process = subprocess.Popen(
cmd,
# stderr=subprocess.PIPE,`
text=True,
bufsize=1,
env=env
)
# threading.Thread(target=filter_stderr_with_progress, args=(self.process.stderr,), daemon=True).start()
This change worked to show more logging in the terminal, and with anticipation I tried to load the affected model with multimodal support so I could see the problem. And then ... it worked!
So the likely issue is that with multimodal turn on, more output is sent to the buffer, and it deadlocks before being read. This is Ooba's file, not Llama's. One option would be to log stderr as a tmp file rather than filtering it, to prevent the deadlock.
I'm just going to leave it like that in my local filesystem since it works now..
@CalculonPrime thanks a lot for the detailed information. That was a subtle bug. I think that it should be resolved after https://github.com/oobabooga/text-generation-webui/commit/8be798e15f48bd1f498d2c609ddf2f31cf22524b