NanoLLM
NanoLLM copied to clipboard
Support for LLaVA v1.6?
I am trying to use the model llava-v1.6-mistral-7b-hf in the text-generation-webui demo, but I am getting errors. The last few lines of the error message read like:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py:1119 in from_pretrained
│
│1118 except KeyError:
│1119 raise ValueError(
│1120 f"The checkpoint you are trying to load has model type {config_dict
╰
ValueError: The checkpoint you are trying to load has model type llava_next but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
I suspect that's because LLaVA v1.6 is not yet supported on Jetson AGX Orin. Is that the case?
Thanks
@rsun-bdti Llava-1.6 had been working, but looks like they updated the model_type to llava_next and the parsing/config patching needs updated here:
https://github.com/dusty-nv/NanoLLM/blob/93b8af0c6dbebb4482ad5d5d4bc9d923a6702ccd/nano_llm/nano_llm.py#L304
for now you can try manually changing the model's config.json to llava instead of llava_next and see if that helps
@dusty-nv Thanks. I made the suggested change. Now it's telling me that "TypeError: llava isn't supported yet".
I tried two commands:
jetson-containers run.sh --workdir=/opt/text-generation-webui $(autotag text-generation-webui) python3 server.py --listen --model-dir /data/models/text-generation-webui --model llava-hf_llava-v1.6-mistral-7b-hf --multimodal-pipeline llava-v1.6-7b --loader autogptq --disable_exllama --verbose
jetson-containers run.sh --workdir=/opt/text-generation-webui $(autotag text-generation-webui) python3 server.py --listen --model-dir /data/models/text-generation-webui --model llava-hf_llava-v1.6-mistral-7b-hf --multimodal-pipeline llava-v1.5-7b --loader autogptq --disable_exllama --verbose
Note the only difference is --multimodal-pipeline llava-v1.6-7b and --multimodal-pipeline llava-v1.5-7b.
Before I changed config.json, if I used --multimodal-pipeline llava-v1.6-7b, I got the error message:
RuntimeError: Multimodal - ERROR: Failed to load multimodal pipeline "llava-v1.6-7b", available pipelines are: ['llava-7b', 'llava-13b', 'llava-llama-2-13b', 'llava-v1.5-13b', 'llava-v1.5-7b']. Please specify a correct pipeline, or disable the extension
Thanks
I tried the same model in the live streaming demo, with the command below:
python3 -m nano_llm.agents.video_query --api=mlc --model llava-hf/llava-v1.6-mistral-7b-hf --max-context-len 1024 --max-new-tokens 64 --video-input /dev/video0 --video-output webrtc://localhost:8550/output
The demo crashed with the last few lines of error message like this: ... ... ... ... ... RuntimeError: subprocess has an invalid initialization status (<class 'subprocess.CalledProcessError'>) Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/opt/NanoLLM/nano_llm/plugins/process_proxy.py", line 132, in run_process raise error File "/opt/NanoLLM/nano_llm/plugins/process_proxy.py", line 126, in run_process self.plugin = ChatQuery(**kwargs) File "/opt/NanoLLM/nano_llm/plugins/chat_query.py", line 70, in init self.model = NanoLLM.from_pretrained(model, **kwargs) File "/opt/NanoLLM/nano_llm/nano_llm.py", line 73, in from_pretrained model = MLCModel(model_path, **kwargs) File "/opt/NanoLLM/nano_llm/models/mlc.py", line 59, in init quant = MLCModel.quantize(self.model_path, self.config, method=quantization, max_context_len=max_context_len, **kwargs) File "/opt/NanoLLM/nano_llm/models/mlc.py", line 278, in quantize subprocess.run(cmd, executable='/bin/bash', shell=True, check=True) File "/usr/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command 'python3 -m mlc_llm.build --model /data/models/mlc/dist/models/llava-v1.6-mistral-7b-hf --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 1024 --artifact-path /data/models/mlc/dist/llava-v1.6-mistral-7b-hf-ctx1024 --use-safetensors ' returned non-zero exit status 1. ... ... ... ... ...
Strangely, when I tried a different LLaVA v1.6 model, the live streaming demo ran as expected. Now I am not so sure if it's a LLaVA v1.6 support issue or not. The working command is below:
python3 -m nano_llm.agents.video_query --api=mlc --model liuhaotian/llava-v1.6-vicuna-7b --max-context-len 1024 --max-new-tokens 64 --video-input /dev/video0 --video-output webrtc://localhost:8550/output
My docker container has the following info:
Namespace(packages=['nano_llm'], prefer=['local', 'registry', 'build'], disable=[''], user='dustynv', output='/tmp/autotag', quiet=False, verbose=False) -- L4T_VERSION=36.2.0 JETPACK_VERSION=6.0 CUDA_VERSION=12.2 -- Finding compatible container image for ['nano_llm'] dustynv/nano_llm:24.5-r36.2.0
Ok yea, I haven't confirmed the mistral one and unlike the vicuna-based ones haven't listed it as tested. The actual errors from it are probably higher up in the log. And with oobabooga, I stopped keeping up with that because it was quite a pain. However I think ollama and openwebui have multimodal support for llava gguf that works.
Just in case it helps, the newer llama3-based lmms-lab/llama3-llava-next-8b (https://huggingface.co/lmms-lab/llama3-llava-next-8b) works out of the box — but you have to add --chat-template=llama-3 (or set it programmatically; auto-detection doesn't work for this model).
@ai-and-i I will give that a try. Thanks!