text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

Model Llama-4-Scout-17B-16E-Instruct GGUF fails to load with Multimodal support

Open CalculonPrime opened this issue 4 months ago • 3 comments

Describe the bug

Running the stated model+mmproj fails to load. No error message is shown. It just hangs.

You'll hit the problem with either of these Unsloth quants of Llama-4-Scout-17B-16E-Instruct:

  • https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF/tree/main/UD-Q6_K_XL or
  • https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF/tree/main/UD-Q8_K_XL

You can use either of the mmproj files:

  • https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF/blob/main/mmproj-BF16.gguf or
  • https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF/blob/main/mmproj-F16.gguf

All files were verified via SHA256 hashes.

Is there an existing issue for this?

  • [x] I have searched the existing issues

Reproduction

Follow the Multimodal WIKI:

  • Select the GGUF model file in the dropdown on the Model page
  • Select the F16 GGUF mmproj from the picker
  • Click load to load the model

The same model loads fine if no mmproj file is selected.

If you watch VRAM and RAM, you'll see the model finish loading, but the llama-server seems to stop, and the model never announces it's done loading so you can't chat. If I run Filemon to see what's happening, I see that when the main model file is done loading, it starts to load the mmproj file, loads about 25K of it, and then closes the file and hangs.

These seems to happen with both quants listed, and with both mmproj files listed. I have dual GPUs, a 5060 an 4060, but setting CUDA_VISIBLE_DEVICES to limit it to 1 GPU in the shell before running the .bat doesn't seem to fix it.

My guess is, whatever's reading the mmproj file doesn't like it and bails out without reporting the error.

Screenshot

No response

Logs

No error is reported on the screen.  Loading starts normally, but never finishes.

System Info

Windows 10, Intel 13900K, 128GB RAM, 5060 Ti 16GB + 4060 Ti 16GB

CalculonPrime avatar Aug 16 '25 10:08 CalculonPrime

I get the same issue with mistral small 24B in GGUF format. The loading freezes, but from VRAM usage it appears only Mistral 24B loads and the program seems to get stuck on the mmproj. I've installed via portable, one click and manual (python 3.13, 3.11, 3.10) methods and the problem persists.

W10, 3090, 9950X, 96GB DDR5

Holigraphidrome avatar Aug 18 '25 18:08 Holigraphidrome

Looks like this is caused by a deadlock in the error output stream from the underlying llama cpp process.

I tried to debug the issue by making this change to llama_cpp_server.py:

       logger.info(f"Using gpu_layers={shared.args.gpu_layers} | ctx_size={shared.args.ctx_size} | cache_type={cache_type}")
        # Start the server with pipes for output
        self.process = subprocess.Popen(
            cmd,
#            stderr=subprocess.PIPE,`
            text=True,
            bufsize=1,
            env=env
        )

#        threading.Thread(target=filter_stderr_with_progress, args=(self.process.stderr,), daemon=True).start()

This change worked to show more logging in the terminal, and with anticipation I tried to load the affected model with multimodal support so I could see the problem. And then ... it worked!

So the likely issue is that with multimodal turn on, more output is sent to the buffer, and it deadlocks before being read. This is Ooba's file, not Llama's. One option would be to log stderr as a tmp file rather than filtering it, to prevent the deadlock.

I'm just going to leave it like that in my local filesystem since it works now..

CalculonPrime avatar Aug 23 '25 15:08 CalculonPrime

@CalculonPrime thanks a lot for the detailed information. That was a subtle bug. I think that it should be resolved after https://github.com/oobabooga/text-generation-webui/commit/8be798e15f48bd1f498d2c609ddf2f31cf22524b

oobabooga avatar Aug 24 '25 19:08 oobabooga