text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

Request: LLaVA: Large Language and Vision Assitant Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.

Open Gitterman69 opened this issue 1 year ago • 6 comments

https://github.com/haotian-liu/LLaVA

Gitterman69 avatar Apr 18 '23 23:04 Gitterman69

Hell yes this will be great for the image sharing in chat!

OrphBean avatar Apr 19 '23 15:04 OrphBean

Now the question is whether LLaVA or MiniGPT4 #1312 is better

mcmonkey4eva avatar Apr 19 '23 20:04 mcmonkey4eva

LLaVA uses the original llama so it's less censored. However it is using a full FP16 13b, otherwise you would have to train it.

So I think the interesting thing is going to be finding out if this works in 4bit on either repo.

Re-training seems like it will take some serious hardware if you don't use what they pre-made.

Ph0rk0z avatar Apr 21 '23 13:04 Ph0rk0z

I managed to quantize it to 4bit(using oobas fork), text generation works, I'm yet to try it with images, but I feel it has a high chance of working, as we can run the CLIP and translation layers separately, and just input the image tokens from translation layer. I'll probably continue tomorrow, and if it works I'll upload the quant+some hacky way to run it. The additional tokens are there, as when I ran it using original llama config I got: size mismatch for lm_head.weight: copying a param with shape torch.Size([32003, 5120]) from checkpoint, the shape in current model is torch.Size([32000, 5120]) and there are 3 new tokens, so it matches

EDIT: I copied the input embeds from their example: image

and: image so it's going to work in 4 bits

Wojtab avatar Apr 21 '23 17:04 Wojtab

https://huggingface.co/wojtab/llava-13b-v0-4bit-128g if someone wants to use it. The embeds can be generated by the changes here: https://github.com/haotian-liu/transformers_llava/commit/4398c2b96b00b98bf684ce6c4ad620bd3938a6c5 (for the test from previous message I just added torch.save(input_embeds) after line 642)

Wojtab avatar Apr 21 '23 18:04 Wojtab

https://github.com/oobabooga/text-generation-webui/pull/1487

Wojtab avatar Apr 23 '23 02:04 Wojtab

ok, llava support is merged to main, see https://github.com/oobabooga/text-generation-webui/tree/main/extensions/llava

Wojtab avatar Apr 24 '23 00:04 Wojtab

Edit; the model .bin was truncated in a crash. My bad.

Hi, had a bit of trouble. This probably doesn't belong here, per se.

I'm on the 04b98a8..c86e9a3 main from ooba repo, GPTQ-for-llama cuda, transformers 4.29dev, torch 2 and rocm 5.4.3. I get the following error. GPTQ-for-llama cuda does compile for ROCm and run other models. Yes, I was surprised too.

Traceback (most recent call last): File “~/text-generation-webui/server.py”, line 102, in load_model_wrapper shared.model, shared.tokenizer = load_model(shared.model_name) File “~/text-generation-webui/modules/models.py”, line 150, in load_model model = load_quantized(model_name) File “~/text-generation-webui/modules/GPTQ_loader.py”, line 176, in load_quantized model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, kernel_switch_threshold=threshold) File “~/text-generation-webui/modules/GPTQ_loader.py”, line 77, in _load_quant model.load_state_dict(safe_load(checkpoint), strict=False) File “~/anaconda3/envs/text/lib/python3.10/site-packages/safetensors/torch.py”, line 101, in load_file result[k] = f.get_tensor(k) RuntimeError: shape ‘[32003, 5120]’ is invalid for input of size 0

cornpo avatar Apr 24 '23 08:04 cornpo

This is lit when you use it wrong.

mori

Ph0rk0z avatar Apr 24 '23 13:04 Ph0rk0z