text-generation-webui LLaVA support

Ok, multimodality is here. To support LLaVA I created an extension, while I can separate it to a different repo with only an extension, I needed text-generation-webui to support overriding the input_ids/input_embeds. While I was at it, I changed extension handling a bit (there should be no need to update anything in the existing extensions, it's mostly backend changes).

To try it:

download my 4-bit quant from huggingface (I haven't tested it with non-quantized version, it works on 3090, but maybe it will even fit on 12GB of VRAM)
run the webui with my extension enabled python3 server.py --model llava-13b-4bit-128g --wbits 4 --group 128 --chat --model_type=llama --extensions llava
Select LLaVA in instruct mode (should also work in chat, but the template is for instruct)
Add "\n###" to custom stopping strings

Here's a video of it in action:

https://user-images.githubusercontent.com/3718215/233817203-69b57e77-0c55-4fd6-b742-3204bb13b8fc.mp4

Apr 23 '23 02:04 Wojtab

BTW: don't merge it yet!

If it should be merged as a built-in extension then I want to clean up script.py(it's feature complete, but can use some work), and if it shouldn't get merged as a built-in extension I need to remove script.py

Apr 23 '23 04:04 Wojtab

I tried it and it doesn't look like you can talk with the model without image, you are obligated to give him one to get it going. That's a shame because I've heard that training the Vicuna model with picture made him smarter, and I wanted to try it out with regular chat

Apr 23 '23 04:04 BadisG

will it work with ggml models?

Apr 23 '23 06:04 x-legion

Gradio HTTP request redirected to localhost :)
Loading llava-13b-4bit-128g...
Could not find the quantized model in .pt or .safetensors format, exiting...

Done!
Press any key to continue . . .

Probably because download-model.bat automatically named it wojtab_llava-13b-v0-4bit-128g

Apr 23 '23 09:04 CarlKenner

Using wojtab_llava-13b-v0-4bit-128g instead of llava-13b-4bit-128g, I'm now getting this error:

Gradio HTTP request redirected to localhost :)
Loading wojtab_llava-13b-v0-4bit-128g...
Found the following quantized model: models\wojtab_llava-13b-v0-4bit-128g\llava-13b-v0-4bit-128g.safetensors
Traceback (most recent call last):
  File "D:\AI\oobabooga-windows\text-generation-webui\server.py", line 921, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "D:\AI\oobabooga-windows\text-generation-webui\modules\models.py", line 148, in load_model
    model = load_quantized(model_name)
  File "D:\AI\oobabooga-windows\text-generation-webui\modules\GPTQ_loader.py", line 176, in load_quantized
    model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, kernel_switch_threshold=threshold)
  File "D:\AI\oobabooga-windows\text-generation-webui\modules\GPTQ_loader.py", line 44, in _load_quant
    model = AutoModelForCausalLM.from_config(config)
  File "D:\AI\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\auto\auto_factory.py", line 411, in from_config
    return model_class._from_config(config, **kwargs)
  File "D:\AI\oobabooga-windows\installer_files\env\lib\site-packages\transformers\modeling_utils.py", line 1146, in _from_config
    model = cls(config, **kwargs)
  File "D:\AI\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 614, in __init__
    self.model = LlamaModel(config)
  File "D:\AI\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 445, in __init__
    self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
  File "D:\AI\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 445, in <listcomp>
    self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
  File "D:\AI\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 255, in __init__
    self.self_attn = LlamaAttention(config=config)
  File "D:\AI\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 178, in __init__
    self.v_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
  File "D:\AI\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\linear.py", line 96, in __init__
    self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 52428800 bytes.

Done!
Press any key to continue . . .

Apr 23 '23 09:04 CarlKenner

Added more virtual memory in Window's Advanced System Settings, by also using my second hard drive for virtual memory. Now I get this error instead:

Gradio HTTP request redirected to localhost :)
Loading wojtab_llava-13b-v0-4bit-128g...
Found the following quantized model: models\wojtab_llava-13b-v0-4bit-128g\llava-13b-v0-4bit-128g.safetensors
Loading model ...
Done.
Traceback (most recent call last):
  File "D:\AI\oobabooga-windows\text-generation-webui\server.py", line 921, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "D:\AI\oobabooga-windows\text-generation-webui\modules\models.py", line 148, in load_model
    model = load_quantized(model_name)
  File "D:\AI\oobabooga-windows\text-generation-webui\modules\GPTQ_loader.py", line 197, in load_quantized
    model = model.to(torch.device('cuda:0'))
  File "D:\AI\oobabooga-windows\installer_files\env\lib\site-packages\transformers\modeling_utils.py", line 1896, in to
    return super().to(*args, **kwargs)
  File "D:\AI\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1145, in to
    return self._apply(convert)
  File "D:\AI\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
    module._apply(fn)
  File "D:\AI\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
    module._apply(fn)
  File "D:\AI\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "D:\AI\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 844, in _apply
    self._buffers[key] = fn(buf)
  File "D:\AI\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1143, in convert    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 4.00 GiB total capacity; 3.42 GiB already allocated; 0 bytes free; 3.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Done!
Press any key to continue . . .

Apr 23 '23 09:04 CarlKenner

To create a public link, set share=Trueinlaunch(). Traceback (most recent call last): File "/home/cybertimon/miniconda3/lib/python3.10/site-packages/gradio/routes.py", line 394, in run_predict output = await app.get_blocks().process_api( File "/home/cybertimon/miniconda3/lib/python3.10/site-packages/gradio/blocks.py", line 1075, in process_api result = await self.call_function( File "/home/cybertimon/miniconda3/lib/python3.10/site-packages/gradio/blocks.py", line 898, in call_function prediction = await anyio.to_thread.run_sync( File "/home/cybertimon/.local/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "/home/cybertimon/.local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "/home/cybertimon/.local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run result = context.run(func, *args) File "/home/cybertimon/miniconda3/lib/python3.10/site-packages/gradio/utils.py", line 549, in async_iteration return next(iterator) File "/home/cybertimon/Repositorys/text-generation-webui/modules/chat.py", line 222, in cai_chatbot_wrapper for history in chatbot_wrapper(text, state): File "/home/cybertimon/Repositorys/text-generation-webui/modules/chat.py", line 154, in chatbot_wrapper for reply in generate_reply(f"{prompt}{' ' if len(cumulative_reply) > 0 else ''}{cumulative_reply}", state, eos_token=eos_token, stopping_strings=stopping_strings): File "/home/cybertimon/Repositorys/text-generation-webui/modules/text_generation.py", line 225, in generate_reply question, input_ids, inputs_embeds = apply_extensions('tokenizer', state, question, input_ids, None) File "/home/cybertimon/Repositorys/text-generation-webui/modules/extensions.py", line 91, in apply_extensions return EXTENSION_MAP[typ](*args, **kwargs) File "/home/cybertimon/Repositorys/text-generation-webui/modules/extensions.py", line 74, in _apply_tokenizer_extensions prompt, input_ids, input_embeds = getattr(extension, function_name)(state, prompt, input_ids, input_embeds) File "/home/cybertimon/Repositorys/text-generation-webui/extensions/llava/script.py", line 172, in tokenizer_modifier new_input_embeds.append(cur_new_input_embeds) UnboundLocalError: local variable 'cur_new_input_embeds' referenced before assignment

I think it works except for this error. I can load the model, talk to it but when I select an image, I get this

Apr 23 '23 11:04 CyberTimon

I'm able to use this with 8 GB VRAM (Geforce 3060 Ti) with the following arguments:

python server.py --model llava-13b-4bit-128g --wbits 4 --group 128 --chat --model_type=llama --extensions llava --pre_layer 29

Reduce "Max prompt size in tokens" to 500 or less, otherwise you'll get OOM errors after the first response.

After further testing I found it's best to use 0 Max Prompt Size, otherwise there's a chance it will respond to multiple images at once. Although the Max Prompt Size setting is 0, the model still gets around 360 tokens of context each time. I guess this must be hard coded.

Edit: The merged version runs out of memory on my GPU. I added this to settings.json as suggested by the author, and it's working again:

"llava-clip_device": "cpu",
"llava-projector_device": "cpu"

Apr 23 '23 13:04 jparmstr

@BadisG - I added a commit ~30mins before your last message which fixed it, you could've pulled before that @faisalhr1997 - maybe, but you will need to convert llama part of LLaVA to ggml, I haven't tried it @CarlKenner - looks like both error are OOM, first one CPU, second one GPU. Try vicuna-13b without llava extension first to see if it is caused by my changes, or something in your setup @CyberTimon - I think it could've happened if the prompt had a truncated image, try the most recent version @jparmstr - nice, as for the prompt - default template has 101 tokens, and each image takes up 258 tokens (2 for start/end, and 256 of actual image embeddings)

Apr 23 '23 14:04 Wojtab

I can chat with the model without an image but as soon as I enter an image and prompt it, it crashes:

"Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: cannot open shared object file: No such file or directory Aborted"

WSL2 installation, ubuntu 22.04. RTX 4090 and plenty of VRAM left unused.

Apr 23 '23 14:04 jepjoo

I can chat with the model without an image but as soon as I enter an image and prompt it, it crashes:

"Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: cannot open shared object file: No such file or directory Aborted"

WSL2 installation, ubuntu 22.04. RTX 4090 and plenty of VRAM left unused.

https://discuss.pytorch.org/t/libcudnn-cnn-infer-so-8-library-can-not-found/164661

Apr 23 '23 15:04 Wojtab

@jparmstr - nice, as for the prompt - default template has 101 tokens, and each image takes up 258 tokens (2 for start/end, and 256 of actual image embeddings)

That makes sense, I figured the image must take some number of tokens.

I did notice that even with context length 0, the model responds to my questions. For example "What is funny about this image?" it will start with "This image is funny because". I wonder how it's ingesting my prompt when I don't leave any space for it in the context.

Apr 23 '23 15:04 jparmstr

I can chat with the model without an image but as soon as I enter an image and prompt it, it crashes: "Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: cannot open shared object file: No such file or directory Aborted" WSL2 installation, ubuntu 22.04. RTX 4090 and plenty of VRAM left unused.

https://discuss.pytorch.org/t/libcudnn-cnn-infer-so-8-library-can-not-found/164661

Thanks alot, that fixed it! Should have googled this harder myself...

Apr 23 '23 15:04 jepjoo

Ok, at this point it's cleaned up enough to where I wanted it, so it could maybe get merged. Also, I added a possibility to run CLIP/projector on CPU(or at 32bit in cuda, which is now the new default). To run them on CPU, add:

    "llava-clip_device": "cpu",
    "llava-projector_device": "cpu"

to settings.json. To run 16-bit on cuda(old behaviour), add:

    "llava-clip_bits": 16,
    "llava-projector_bits": 16

Clip doesn't look like it supports run_in_8bit, and I feel like the projector doesn't need it, so there is only 16/32 bit (and 32 bit only for CPU).

@jparmstr - you might be able to squeeze some more tokens with CPU CLIP. As for the prompt, you can add print(prompt) after print(f'Embedded {total_embedded} image(s) in {time.time()-start_ts:.2f}s') in script.py

Apr 23 '23 18:04 Wojtab

`(base) cybertimon@server:~/Repositorys/text-generation-webui$ python3 server.py --model llava-13b-4bit-128g --gpu-memory 12 --wbits 4 --model_type llama --groupsize 128 --listen-host 0.0.0.0 --listen --xformers --extension llava --chat --listen-port 21129 Gradio HTTP request redirected to localhost :) Loading settings from settings.json... Loading llava-13b-4bit-128g... Found the following quantized model: models/llava-13b-4bit-128g/pytorch_model.safetensors Loading model ... Done. Using the following device map for the quantized model: {'': 0} Replaced attention with xformers_attention Loaded the model in 4.27 seconds. Loading the extension "llava"... Ok. Loading the extension "gallery"... Ok. {'add_all_images_to_prompt': False, 'clip_device': None, 'clip_bits': 32, 'projector_device': None, 'projector_bits': 32} cuda:0 torch.float32 cuda:0 torch.float32 Running on local URL: http://0.0.0.0:21129

To create a public link, set share=True in launch(). Embedded 0 image(s) in 0.99s`

I get embedded 0 images in 0.99s. Maybe this is the problem from earlier. Also it answers only: 88888888....

Apr 23 '23 18:04 CyberTimon

Also, when I change the settings to use cpu ´{'add_all_images_to_prompt': False, 'clip_device': 'cpu', 'clip_bits': 32, 'projector_device': 'cpu', 'projector_bits': 32} cpu torch.float32 cpu torch.float32´

I still get only 888888 as answer.

Apr 23 '23 19:04 CyberTimon

@CyberTimon remove settings.json, then restart webui, clear the history, and try with this image: https://github.com/haotian-liu/LLaVA/blob/main/llava/serve/examples/extreme_ironing.jpg, with "What is unusual about this image?" prompt, exactly as in my video. If it still gives a garbage output, then try if this model works for you(without llava extension enabled): https://huggingface.co/anon8231489123/vicuna-13b-GPTQ-4bit-128g

Apr 23 '23 19:04 Wojtab

Very impressive @Wojtab, I'll try to review and merge it soon. Quick question: I remember reading on the LLaVA README that a custom version of transformers was needed. How did you get it working with the standard transformers?

Apr 23 '23 19:04 oobabooga

@oobabooga If you load the original LLaVA on standard transformers it works, but instead of loading the entire model, it just loads LLaMA part, so it can be used for text-based inference without any modifications. The modified transformers add image input, then the projector, and then it feeds the embeddings to standard finetuned LLaMA. As there were no modifications to LLaMA architecture, I load it as a standard model, in standard transformers, and just use custom embeddings, by running the image->CLIP->projector pipeline by myself in LLaVAEmbedder, instead of in modified transformers

Apr 23 '23 19:04 Wojtab

@CyberTimon remove settings.json, then restart webui, clear the history, and try with this image: https://github.com/haotian-liu/LLaVA/blob/main/llava/serve/examples/extreme_ironing.jpg, with "What is unusual about this image?" prompt, exactly as in my video. If it still gives a garbage output, then try if this model works for you(without llava extension enabled): https://huggingface.co/anon8231489123/vicuna-13b-GPTQ-4bit-128g

Your a hero! Works perfect now. I had to delete the settings.json

Apr 23 '23 19:04 CyberTimon

Oh I saw what the issue was. When selecting max_new_tokens over 1600 it generates only garbage.

Apr 23 '23 19:04 CyberTimon

It worked for me, but I had to use the tokenizer files that come with wojtab/llava-13b-v0-4bit-128g tokenizer instead of the generic LLaMA tokenizer described here. The web UI has the option of loading the same tokenizer for all LlamaForCausalLM from models/llama-tokenizer as a way of ensuring that the files are up to date (many models on Hugging Face use outdated tokenizer files). Does LLaVA use a custom tokenizer?

Base LLaMa tokenizer	wojtab/llava-13b-v0-4bit-128g tokenizer

The modifications to the extensions framework look good to me and are highly appreciated, thanks for taking the time to read the existing code base in detail.
About merging/not merging script.py itself into the repository: I think that this is a good example that future extensions can use as a starting point, so I vote for merging it.

Apr 23 '23 20:04 oobabooga

Regarding tokenizer: there are 4 new tokens, so I don't think the generic one will work:

{
  "<im_end>": 32002,
  "<im_patch>": 32000,
  "<im_start>": 32001,
  "[PAD]": 32003
}

IMO we can merge it here now, give me like 30 minutes, I'll add a description of the extension. I just fixed the issue @CyberTimon had, the image could be truncated in the middle. It is still broken in vast majority of cases, unless the prompt is like that: but at least there is a warning in logs, so maybe there won't be 20 issues about it. (btw, you can set the image placement inside prompt by adding <image>)

Apr 23 '23 21:04 Wojtab

A comment: the Extensions doc page says

Additionally, the extension can set value to be a callback, in the form of def cb(text: str, visible_text: str) -> [str, str]. See the send_pictures extension above for an example.

But the send_pictures extension does not use a callback.

Apr 23 '23 21:04 oobabooga

@oobabooga ok, I added the docs, also reworded it in Extensions. One more change: I set it to auto-recognize LLaVA as llama-based model

Apr 23 '23 22:04 Wojtab

I have removed this addition because I found it unnecessary, as the chatbot_wrapper function already updates the history (if that was a mistake, please let me know and I'll revert it).

         yield chat_html_wrapper(shared.history['visible'], state['name1'], state['name2'], state['mode'])
     else:
         # Yield ' ...'
-        last_visible_user = shared.history['visible'][-1][0]
         yield chat_html_wrapper(shared.history['visible'][:-1] + [[shared.history['visible'][-1][0], shared.history['visible'][-1][1] + ' ...']], state['name1'], state['name2'], state['mode'])
         for history in chatbot_wrapper(shared.history['internal'][-1][0], state, _continue=True):
-            shared.history['visible'][-1] = [last_visible_user, history[-1][1]]
             yield chat_html_wrapper(shared.history['visible'], state['name1'], state['name2'], state['mode'])

Also made some minor changes and improvements. Thanks for submitting this PR, I would never have come up with the LLaVA adaptation on my own and the reworked extensions framework is a huge improvement to this project.

Apr 23 '23 23:04 oobabooga

For reference, these are the commands to download and run the model:

python download-model.py wojtab/llava-13b-v0-4bit-128g
python3 server.py --model wojtab_llava-13b-v0-4bit-128g --chat  --extensions llava

VRAM usage peaked at 11106MiB for a single generation.

Apr 23 '23 23:04 oobabooga

@oobabooga thanks for the review and merge. Now, this addition was necessary, for some reason continue replaces visible_text with internal text on the message from user, so now instead of it being <img src="data:image/jpeg;base64,{base64string}> visible text becomes the internal representation: <image:{base64string}>. Now, thinking about it, it might've been a stupid idea to separate them, as I can parse both of them as easily, but if you stop the prompt, then click continue, the image will disappear for the user

Apr 23 '23 23:04 Wojtab

@oobabooga actually, instead of reverting it, I'll open a separate PR where both of the representations are the same

Apr 23 '23 23:04 Wojtab

I'll wait for your PR then. I might have used ['internal'] instead of ['visible'] somewhere.

Apr 24 '23 00:04 oobabooga