text-generation-webui
text-generation-webui copied to clipboard
Add lora support?
https://github.com/tloen/alpaca-lora
This repo got LLama-7B working with a lora trained on alpaca json file. there is also a notebook with code.
https://huggingface.co/tloen/alpaca-lora-7b
This would be amazing!
I think GPTQ would be where lora support gets added, no?
Given this looks like the key addition from the alpaca lora code -
model = LLaMAForCausalLM.from_pretrained( "decapoda-research/llama-7b-hf", load_in_8bit=True, device_map="auto", ) model = PeftModel.from_pretrained(model, "tloen/alpaca-lora-7b")
This should be the next step.
- [x] Add a tab where you can load pre-trained LoRAs ~and train your own~
After that we will need someone to come up with the textgen version of civitai :^)
WIP here: https://github.com/oobabooga/text-generation-webui/pull/366
my device is GTX 1650 4GB,i512400 , 40BG RAM.
I have set llama-7b according to the wiki
I can run it with python server.py --listen --auto-devices --model llama-7b
and everything goes well!
But I can't run with --load-in-8bit according to https://github.com/oobabooga/text-generation-webui/pull/366 I should use this.
when I begin with python server.py --listen --auto-devices --model llama-7b --load-in-8bit
There is no error, everything seeming good,BUT once I use the web ui click the ‘Generate’ button,
there error comes in the terminal
(textgen) wk:text-generation-webui$ python server.py --listen --auto-devices --model llama-7b --load-in-8bit
Loading llama-7b...
Auto-assiging --gpu-memory 3 for your GPU to try to prevent out-of-memory errors.
You can manually set other values.
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /home/wk/anaconda3/envs/textgen did not contain libcudart.so as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Loading checkpoint shards: 100%|████████████████| 33/33 [00:06<00:00, 4.81it/s]
Loaded the model in 7.58 seconds.
/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
warnings.warn(value)
Running on local URL: http://0.0.0.0:7860
To create a public link, set `share=True` in `launch()`.
cuBLAS API failed with status 15
A: torch.Size([16, 4096]), B: torch.Size([4096, 4096]), C: (16, 4096); (lda, ldb, ldc): (c_int(512), c_int(131072), c_int(512)); (m, n, k): (c_int(16), c_int(4096), c_int(4096))
Exception in thread Thread-4 (gentask):
error detectedTraceback (most recent call last):
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/wk/data/text-generation-webui/modules/callbacks.py", line 64, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "/home/wk/data/text-generation-webui/modules/text_generation.py", line 196, in generate_with_callback
shared.model.generate(**kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 1452, in generate
return self.sample(
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 2468, in sample
outputs = self(
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 772, in forward
outputs = self.model(
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 621, in forward
layer_outputs = decoder_layer(
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 316, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 216, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 377, in forward
out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1410, in igemmlt
raise Exception('cublasLt ran into an error!')
Exception: cublasLt ran into an error!
@wk-mike I also have a GTX 1650 on my laptop and this error also happens to me when I try to use --load-in-8bit with it.
I have never been able to figure out the cause. You can start a new issue for this with the error message that you just posted, maybe someone else can help.
OK!
it can be with cpu,
python server.py --listen --cpu --model llama-7b --load-in-8bit
I test it, it's ok.
Merged now
pip install -r requirements.txt
python download-model.py tloen/alpaca-lora-7b
python server.py --model llama-7b --load-in-8bit
Then select the LoRA in the parameters tab. Alternatively, start the web UI with
python server.py --listen --model llama-7b --load-in-8bit --lora alpaca-lora-7b
I can run it with cpu, but still get error with gpu
python server.py --listen --model llama-7b --load-in-8bit --lora alpaca-lora-7b --cpu good
python server.py --listen --model llama-7b --load-in-8bit --lora alpaca-lora-7b --auto-devices not good
with
(textgen) wk:text-generation-webui$ python server.py --listen --model llama-7b --load-in-8bit --lora alpaca-lora-7b --auto-devices
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /home/wk/anaconda3/envs/textgen did not contain libcudart.so as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Loading llama-7b...
Auto-assiging --gpu-memory 3 for your GPU to try to prevent out-of-memory errors.
You can manually set other values.
Loading checkpoint shards: 100%|████████████████| 33/33 [00:06<00:00, 4.83it/s]
Loaded the model in 6.97 seconds.
alpaca-lora-7b
Adding the LoRA alpaca-lora-7b to the model...
Traceback (most recent call last):
File "/home/wk/data/text-generation-webui/server.py", line 240, in <module>
add_lora_to_model(shared.lora_name)
File "/home/wk/data/text-generation-webui/modules/LoRA.py", line 17, in add_lora_to_model
shared.model = PeftModel.from_pretrained(shared.model, Path(f"loras/{lora_name}"))
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/peft/peft_model.py", line 143, in from_pretrained
model = MODEL_TYPE_TO_PEFT_MODEL_MAPPING[config.task_type](model, config)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/peft/peft_model.py", line 514, in __init__
super().__init__(model, peft_config)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/peft/peft_model.py", line 79, in __init__
self.base_model = LoraModel(peft_config, model)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/peft/tuners/lora.py", line 118, in __init__
self._find_and_replace()
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/peft/tuners/lora.py", line 163, in _find_and_replace
new_module = Linear(target.in_features, target.out_features, bias=bias, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/peft/tuners/lora.py", line 293, in __init__
nn.Linear.__init__(self, in_features, out_features, **kwargs)
TypeError: Linear.__init__() got an unexpected keyword argument 'has_fp16_weights'
It's impressive that this works in CPU mode at all, given that it doesn't seem to work in GPU mode without --load-in-8bit at the moment.
I can run it with cpu, but still get error with gpu `python server.py --listen --model llama-7b --load-in-8bit --lora alpaca-lora-7b
Hi, did you find any solution for this? I'm having the same issue.
Merged now
pip install -r requirements.txt python download-model.py tloen/alpaca-lora-7b python server.py --model llama-7b --load-in-8bitThen select the LoRA in the parameters tab. Alternatively, start the web UI with
python server.py --listen --model llama-7b --load-in-8bit --lora alpaca-lora-7b
Hm, i did exactly this and i get
server.py: error: unrecognized arguments: --lora alpaca-lora-7b
EDIT: I'm stupid. Forgot to update with git pull. But now i get this error and can't start the web ui even without --lora:
Traceback (most recent call last): File "J:\LLaMA\text-generation-webui\server.py", line 13, in <module> import modules.chat as chat File "J:\LLaMA\text-generation-webui\modules\chat.py", line 14, in <module> from modules.html_generator import fix_newlines, generate_chat_html File "J:\LLaMA\text-generation-webui\modules\html_generator.py", line 11, in <module> import markdown ModuleNotFoundError: No module named 'markdown'
Run pip install -r requirements.txt
Run
pip install -r requirements.txt
I did that. Had to do the 8-bit fix all over again after that and then something else broke and i was so frustrated that i deleted everything and trying a fresh installation now...
Try this, it worked for me:
https://github.com/oobabooga/text-generation-webui/issues/400#issuecomment-1474876859
Hey!
I made the Lora work in 4 bits. python server.py --model llama-7b --gptq-bits 4 --cai-chat
I changed the lora.py from this package: C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\peft\tuners\lora.py
Here's the modified version (I don't know how to put files on github so I'll put a link) https://pastebin.com/eUWZsirk
I added those 2 instructions on the _find_and_replace() method
-
new_module = None # Add this line to initialize the new_module variable
-
if new_module is None: continue
@BadisG I am not sure if this is really working. Here is a test
Prompt
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Write a poem about the transformers Python library.
### Response:
Preset
Debug-deterministic
LoRA
https://huggingface.co/chansung/alpaca-lora-13b
8-bit mode results
python server.py --load-in-8bit --model llama-13b-hf --listen --lora alpaca-lora-13b
Transformers, the Python library,
Can help you with your data science.
It can be used to create models,
And to transform data in a variety of ways.
It can be used to create models,
And to transform data in a variety of ways.
It can be used to create models,
And to transform data in a variety of ways.
It can be used to create models,
And to transform data in a variety of ways.
4-bit mode results
python server.py --gptq-bits 4 --model llama-13b-hf --listen --lora alpaca-lora-13b
Write a poem about the transformers Python library.
### Instruction:
Write a poem about the transformers Python library.
### Response:
Write a poem about the transformers Python library.
### Instruction:
Write a poem about the transformers Python library.
### Response:
Write a poem about the transformers Python library.
4-bit mode results without any LoRA
python server.py --gptq-bits 4 --model llama-13b-hf --listen
Write a poem about the transformers Python library.
### Instruction:
Write a poem about the transformers Python library.
### Response:
Write a poem about the transformers Python library.
### Instruction:
Write a poem about the transformers Python library.
### Response:
Write a poem about the transformers Python library.
@BadisG I am not sure if this is really working. Here is a test
Are you sure this is the right way to do? Tbh I'm not a specialist on it at all but on llama.cpp you have a seed you can reuse to get the same result all the time, no matter the Generation parameters preset.
If you have something like this on your code maybe you could consider it that way. Either I feel the "Debug Deterministic" is way too restrictive and a simple lora can't change anything either my fix wasn't good enough...
EDIT : The lora works on a random Generation parameters preset, When I put (NovelAI-Sphinx Moth) and I disable "do_sample", it gives the same answer everytime:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Write a poem about the transformers Python library.
### Response::
The transformer is a robot that can change from one vehicle to another. It has a red body, blue head and yellow arms. The transformer's name is Optimus Prime. He is a leader of the Autobots. His main weapon is his sword. He also has a gun called "the power". He can fly in space or on land. He can go...
When I add the Lora I got this:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Write a poem about the transformers Python library.
### Response::
The transformer is a machine learning algorithm that can be used to classify data into different categories, such as cars and trucks. The transformer is based on the idea of neural networks. Neural Networks are a type of artificial intelligence (AI) that uses deep learning to learn from examples. Deep Learning is a branch of AI that learns...
This is what I got from chatgpt about the "do_sample = False"
"if you use do_sample=False, the model uses greedy decoding to generate text, consistently choosing the word with the highest probability. In this case, the text generation process is deterministic, and the use of a seed does not have a significant effect on the results."
In summary, if you want reproductive results, just use do_sample = False and you can choose any Generation parameters preset you want.
Boss, there is this comment for the 4bit don't know if you saw this already https://github.com/oobabooga/text-generation-webui/issues/332#issuecomment-1474883977 I am in the process of trying it myself
Lora 100% is supposed to make it deterministic: https://github.com/oobabooga/text-generation-webui/issues/419
If it is not then the lora isn't working.
@Ph0rk0z does that make sense? Why would there be no sampling when a LoRA is in use?
Lora 100% is supposed to make it deterministic: #419
If it is not then the lora isn't working.
The presence of Lora does not alter the deterministic aspect of your model. Regardless of whether you have Lora or not, you can always modify the reproducibility of your outcomes by adjusting the seed or enabling/disabling the "do_sample" feature.
Well 4 bit by itself is deterministic. 8/fp16 was not, unless you count producing a stream of unending garbage every time as deterministic. Turning off do_sample allows 8bit to generate without int8 threshold parameter for me.. but text never appeared. So I think that 4bit lora is going to be suspect, especially without do_sample.
about greedy decoding: https://towardsdatascience.com/the-three-decoding-methods-for-nlp-23ca59cb1e9d In short it is :(
Well 4 bit by itself is deterministic. 8/fp16 was not, unless you count producing a stream of unending garbage every time as deterministic. Turning off do_sample allows 8bit to generate without int8 threshold parameter for me.. but text never appeared. So I think that 4bit lora is going to be suspect, especially without do_sample.
about greedy decoding: https://towardsdatascience.com/the-three-decoding-methods-for-nlp-23ca59cb1e9d In short it is :(
when I put "do_sample = False" and I generate 10 times the text with Lora, I got 10 times the same result ("Text LORA" 10 times). The result is exactly the same when I generate 10 times the text without Lora ("Text NO LORA" 10 times)
But of course "Text LORA" and "Text NO LORA" are different to each other, that's the point of a Lora, to give you something different compared to the raw model
Yes.. but do_sample = False generations are repetitive garbage and you use (NovelAI-Sphinx Moth) in your example. With randomness enabled generation parameters, you can avoid the problems like I had experienced, for a while, too. I really see what that debug preset means when I started using it.
The point of that preset is to be restrictive. Nobody is saying you can't keep using it like this but it still looks broken if it can't even use anything but greedy decoding.
Also, another question, because I have only 1.5 brain cells. Do things like top_p, and temperature even do anything without do sample?
Do things like top_p, and temperature even do anything without do sample?
No they don't, do_sample is the same as greedy sampling.
Back to the original point: I see people claiming to use this 30b LoRA. How? https://huggingface.co/chansung/alpaca-lora-30b
Yes.. but
do_sample = Falsegenerations are repetitive garbage and you use (NovelAI-Sphinx Moth) in your example. With randomness enabled generation parameters, you can avoid the problems like I had experienced, for a while, too. I really see what that debug preset means when I started using it.The point of that preset is to be restrictive. Nobody is saying you can't keep using it like this but it still looks broken if it can't even use anything but greedy decoding.
But your "debug preset" also has do_sample = False, that's exactly why it that makes it as a debug preset actually.
The best way to see the reproducibility of an output is to just fix the seed.
On llama.cpp we can do that:
SEED = 1 (Always the same output for a fixed seed)

SEED = 2 (Always the same output for a fixed seed)

Like that you can have a (do_sample = True) + Fixed seed = Good result that will alaways be the same = Perfect reproducibility
Do things like top_p, and temperature even do anything without do sample?
No they don't, do_sample is the same as greedy sampling.
Back to the original point: I see people claiming to use this 30b LoRA. How? https://huggingface.co/chansung/alpaca-lora-30b
A6000 48gb? Running it 4bit like he did? Gotta test all and see.
Is there something I need to do to support multi-gpu configuration lora?

I think I'm running into this bug https://github.com/huggingface/peft/issues/115#issuecomment-1460706852
Looks like I may need to modify PeftModel.from_pretrained or PeftModelForCausalLM but I'm not sure where...
I think something is broken for int8 split-model lora right now... but not sure where to fix... I think this guy did it... https://github.com/huggingface/peft/issues/115#issuecomment-1441016348
I found a really hacky fix...
I kept on running OOM as the model loads lopsided... so I made the following changes to the modules/LoRA.py file:
- replace
params['device_map'] = {'': 0}with#params['device_map'] = {'': 0} - add
params['max_memory'] = {0: "16GiB", 1: "25GiB"}just below it.
note: replace 16GiB and 25GiB with whatever launch parameter you're sending to "server.py" as the "--gpu-memory" value

I've got a new error somehow during the loading of the 13b lora
CUDA SETUP: Loading binary C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\bitsandbyte
s\libbitsandbytes_cuda116.dll...
Adding the LoRA alpaca-lora-13b to the model...
Traceback (most recent call last):
File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\gradio\routes.py", line 374, i
n run_predict
output = await app.get_blocks().process_api(
File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\gradio\blocks.py", line 1017,
in process_api
result = await self.call_function(
File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\gradio\blocks.py", line 835, i
n call_function
prediction = await anyio.to_thread.run_sync(
File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\anyio\to_thread.py", line 31,
in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\anyio\_backends\_asyncio.py",
line 937, in run_sync_in_worker_thread
return await future
File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\anyio\_backends\_asyncio.py",
line 867, in run
result = context.run(func, *args)
File "D:\Large Language Models\text-generation-webui\server.py", line 73, in load_lora_wrapper
add_lora_to_model(selected_lora)
File "D:\Large Language Models\text-generation-webui\modules\LoRA.py", line 22, in add_lora_to_mod
el
shared.model = PeftModel.from_pretrained(shared.model, Path(f"loras/{lora_name}"), **params)
File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\peft\peft_model.py", line 167,
in from_pretrained
max_memory = get_balanced_memory(
File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\accelerate\utils\modeling.py",
line 452, in get_balanced_memory
per_gpu = module_sizes[""] // (num_devices - 1 if low_zero else num_devices)
ZeroDivisionError: integer division or modulo by zero
I fixed it by changing th modeling.py file that is on this package: C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\accelerate\utils\modeling.py
On line 452 you replace this: per_gpu = module_sizes[""] // (num_devices - 1 if low_zero else num_devices)
By this: per_gpu = module_sizes[""] // (num_devices - 1 if low_zero else num_devices) if num_devices != 0 else 0