text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

Add lora support?

Open 7Tenku opened this issue 2 years ago • 110 comments
trafficstars

https://github.com/tloen/alpaca-lora

This repo got LLama-7B working with a lora trained on alpaca json file. there is also a notebook with code.

7Tenku avatar Mar 15 '23 10:03 7Tenku

https://huggingface.co/tloen/alpaca-lora-7b

7Tenku avatar Mar 15 '23 11:03 7Tenku

This would be amazing!

lolxdmainkaisemaanlu avatar Mar 15 '23 13:03 lolxdmainkaisemaanlu

I think GPTQ would be where lora support gets added, no?

Given this looks like the key addition from the alpaca lora code -

model = LLaMAForCausalLM.from_pretrained( "decapoda-research/llama-7b-hf", load_in_8bit=True, device_map="auto", ) model = PeftModel.from_pretrained(model, "tloen/alpaca-lora-7b")

fblissjr avatar Mar 15 '23 15:03 fblissjr

This should be the next step.

  • [x] Add a tab where you can load pre-trained LoRAs ~and train your own~

After that we will need someone to come up with the textgen version of civitai :^)

oobabooga avatar Mar 16 '23 03:03 oobabooga

WIP here: https://github.com/oobabooga/text-generation-webui/pull/366

oobabooga avatar Mar 17 '23 00:03 oobabooga

my device is GTX 1650 4GB,i512400 , 40BG RAM.

I have set llama-7b according to the wiki

I can run it with python server.py --listen --auto-devices --model llama-7b and everything goes well!

But I can't run with --load-in-8bit according to https://github.com/oobabooga/text-generation-webui/pull/366 I should use this. when I begin with python server.py --listen --auto-devices --model llama-7b --load-in-8bit There is no error, everything seeming good,BUT once I use the web ui click the ‘Generate’ button,

there error comes in the terminal

(textgen) wk:text-generation-webui$ python server.py --listen --auto-devices --model llama-7b --load-in-8bit
Loading llama-7b...
Auto-assiging --gpu-memory 3 for your GPU to try to prevent out-of-memory errors.
You can manually set other values.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /home/wk/anaconda3/envs/textgen did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Loading checkpoint shards: 100%|████████████████| 33/33 [00:06<00:00,  4.81it/s]
Loaded the model in 7.58 seconds.
/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
  warnings.warn(value)
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
cuBLAS API failed with status 15
A: torch.Size([16, 4096]), B: torch.Size([4096, 4096]), C: (16, 4096); (lda, ldb, ldc): (c_int(512), c_int(131072), c_int(512)); (m, n, k): (c_int(16), c_int(4096), c_int(4096))
Exception in thread Thread-4 (gentask):
error detectedTraceback (most recent call last):
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/wk/data/text-generation-webui/modules/callbacks.py", line 64, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "/home/wk/data/text-generation-webui/modules/text_generation.py", line 196, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 1452, in generate
    return self.sample(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 2468, in sample
    outputs = self(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 772, in forward
    outputs = self.model(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 621, in forward
    layer_outputs = decoder_layer(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 316, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 216, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 377, in forward
    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1410, in igemmlt
    raise Exception('cublasLt ran into an error!')
Exception: cublasLt ran into an error!

wk-mike avatar Mar 17 '23 13:03 wk-mike

@wk-mike I also have a GTX 1650 on my laptop and this error also happens to me when I try to use --load-in-8bit with it.

I have never been able to figure out the cause. You can start a new issue for this with the error message that you just posted, maybe someone else can help.

oobabooga avatar Mar 17 '23 13:03 oobabooga

OK!

it can be with cpu, python server.py --listen --cpu --model llama-7b --load-in-8bit I test it, it's ok.

wk-mike avatar Mar 17 '23 13:03 wk-mike

Merged now

pip install -r requirements.txt
python download-model.py tloen/alpaca-lora-7b
python server.py --model llama-7b --load-in-8bit

Then select the LoRA in the parameters tab. Alternatively, start the web UI with

python server.py --listen --model llama-7b --load-in-8bit  --lora alpaca-lora-7b

oobabooga avatar Mar 17 '23 14:03 oobabooga

I can run it with cpu, but still get error with gpu python server.py --listen --model llama-7b --load-in-8bit --lora alpaca-lora-7b --cpu good

python server.py --listen --model llama-7b --load-in-8bit --lora alpaca-lora-7b --auto-devices not good with

(textgen) wk:text-generation-webui$ python server.py --listen --model llama-7b --load-in-8bit  --lora alpaca-lora-7b --auto-devices

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /home/wk/anaconda3/envs/textgen did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Loading llama-7b...
Auto-assiging --gpu-memory 3 for your GPU to try to prevent out-of-memory errors.
You can manually set other values.
Loading checkpoint shards: 100%|████████████████| 33/33 [00:06<00:00,  4.83it/s]
Loaded the model in 6.97 seconds.
alpaca-lora-7b
Adding the LoRA alpaca-lora-7b to the model...
Traceback (most recent call last):
  File "/home/wk/data/text-generation-webui/server.py", line 240, in <module>
    add_lora_to_model(shared.lora_name)
  File "/home/wk/data/text-generation-webui/modules/LoRA.py", line 17, in add_lora_to_model
    shared.model = PeftModel.from_pretrained(shared.model, Path(f"loras/{lora_name}"))
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/peft/peft_model.py", line 143, in from_pretrained
    model = MODEL_TYPE_TO_PEFT_MODEL_MAPPING[config.task_type](model, config)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/peft/peft_model.py", line 514, in __init__
    super().__init__(model, peft_config)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/peft/peft_model.py", line 79, in __init__
    self.base_model = LoraModel(peft_config, model)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/peft/tuners/lora.py", line 118, in __init__
    self._find_and_replace()
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/peft/tuners/lora.py", line 163, in _find_and_replace
    new_module = Linear(target.in_features, target.out_features, bias=bias, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/peft/tuners/lora.py", line 293, in __init__
    nn.Linear.__init__(self, in_features, out_features, **kwargs)
TypeError: Linear.__init__() got an unexpected keyword argument 'has_fp16_weights'

wk-mike avatar Mar 17 '23 15:03 wk-mike

It's impressive that this works in CPU mode at all, given that it doesn't seem to work in GPU mode without --load-in-8bit at the moment.

oobabooga avatar Mar 17 '23 15:03 oobabooga

I can run it with cpu, but still get error with gpu `python server.py --listen --model llama-7b --load-in-8bit --lora alpaca-lora-7b

Hi, did you find any solution for this? I'm having the same issue.

athu16 avatar Mar 18 '23 14:03 athu16

Merged now

pip install -r requirements.txt
python download-model.py tloen/alpaca-lora-7b
python server.py --model llama-7b --load-in-8bit

Then select the LoRA in the parameters tab. Alternatively, start the web UI with

python server.py --listen --model llama-7b --load-in-8bit  --lora alpaca-lora-7b

Hm, i did exactly this and i get

server.py: error: unrecognized arguments: --lora alpaca-lora-7b

EDIT: I'm stupid. Forgot to update with git pull. But now i get this error and can't start the web ui even without --lora:

Traceback (most recent call last): File "J:\LLaMA\text-generation-webui\server.py", line 13, in <module> import modules.chat as chat File "J:\LLaMA\text-generation-webui\modules\chat.py", line 14, in <module> from modules.html_generator import fix_newlines, generate_chat_html File "J:\LLaMA\text-generation-webui\modules\html_generator.py", line 11, in <module> import markdown ModuleNotFoundError: No module named 'markdown'

patrickmros avatar Mar 18 '23 15:03 patrickmros

Run pip install -r requirements.txt

oobabooga avatar Mar 18 '23 15:03 oobabooga

Run pip install -r requirements.txt

I did that. Had to do the 8-bit fix all over again after that and then something else broke and i was so frustrated that i deleted everything and trying a fresh installation now...

patrickmros avatar Mar 18 '23 15:03 patrickmros

Try this, it worked for me:

https://github.com/oobabooga/text-generation-webui/issues/400#issuecomment-1474876859

oobabooga avatar Mar 18 '23 15:03 oobabooga

Hey!

I made the Lora work in 4 bits. python server.py --model llama-7b --gptq-bits 4 --cai-chat

I changed the lora.py from this package: C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\peft\tuners\lora.py

Here's the modified version (I don't know how to put files on github so I'll put a link) https://pastebin.com/eUWZsirk

I added those 2 instructions on the _find_and_replace() method

  1. new_module = None # Add this line to initialize the new_module variable

  2. if new_module is None: continue

BadisG avatar Mar 18 '23 15:03 BadisG

@BadisG I am not sure if this is really working. Here is a test

Prompt

Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Write a poem about the transformers Python library. 
### Response:

Preset

Debug-deterministic

LoRA

https://huggingface.co/chansung/alpaca-lora-13b

8-bit mode results

python server.py --load-in-8bit --model llama-13b-hf --listen --lora alpaca-lora-13b
Transformers, the Python library,
Can help you with your data science.
It can be used to create models,
And to transform data in a variety of ways.
It can be used to create models,
And to transform data in a variety of ways.
It can be used to create models,
And to transform data in a variety of ways.
It can be used to create models,
And to transform data in a variety of ways.

4-bit mode results

python server.py --gptq-bits 4 --model llama-13b-hf --listen --lora alpaca-lora-13b
Write a poem about the transformers Python library.
### Instruction:
Write a poem about the transformers Python library. 
### Response:
Write a poem about the transformers Python library.
### Instruction:
Write a poem about the transformers Python library. 
### Response:
Write a poem about the transformers Python library.

4-bit mode results without any LoRA

python server.py --gptq-bits 4 --model llama-13b-hf --listen
Write a poem about the transformers Python library.
### Instruction:
Write a poem about the transformers Python library. 
### Response:
Write a poem about the transformers Python library.
### Instruction:
Write a poem about the transformers Python library. 
### Response:
Write a poem about the transformers Python library.

oobabooga avatar Mar 19 '23 15:03 oobabooga

@BadisG I am not sure if this is really working. Here is a test

Are you sure this is the right way to do? Tbh I'm not a specialist on it at all but on llama.cpp you have a seed you can reuse to get the same result all the time, no matter the Generation parameters preset.

If you have something like this on your code maybe you could consider it that way. Either I feel the "Debug Deterministic" is way too restrictive and a simple lora can't change anything either my fix wasn't good enough...

EDIT : The lora works on a random Generation parameters preset, When I put (NovelAI-Sphinx Moth) and I disable "do_sample", it gives the same answer everytime:

Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Write a poem about the transformers Python library. 
### Response::
The transformer is a robot that can change from one vehicle to another. It has a red body, blue head and yellow arms. The transformer's name is Optimus Prime. He is a leader of the Autobots. His main weapon is his sword. He also has a gun called "the power". He can fly in space or on land. He can go...

When I add the Lora I got this:

Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Write a poem about the transformers Python library. 
### Response::
The transformer is a machine learning algorithm that can be used to classify data into different categories, such as cars and trucks. The transformer is based on the idea of neural networks. Neural Networks are a type of artificial intelligence (AI) that uses deep learning to learn from examples. Deep Learning is a branch of AI that learns...

This is what I got from chatgpt about the "do_sample = False"

"if you use do_sample=False, the model uses greedy decoding to generate text, consistently choosing the word with the highest probability. In this case, the text generation process is deterministic, and the use of a seed does not have a significant effect on the results."

In summary, if you want reproductive results, just use do_sample = False and you can choose any Generation parameters preset you want.

BadisG avatar Mar 19 '23 15:03 BadisG

Boss, there is this comment for the 4bit don't know if you saw this already https://github.com/oobabooga/text-generation-webui/issues/332#issuecomment-1474883977 I am in the process of trying it myself

if-ai avatar Mar 19 '23 15:03 if-ai

Lora 100% is supposed to make it deterministic: https://github.com/oobabooga/text-generation-webui/issues/419

If it is not then the lora isn't working.

Ph0rk0z avatar Mar 19 '23 16:03 Ph0rk0z

@Ph0rk0z does that make sense? Why would there be no sampling when a LoRA is in use?

oobabooga avatar Mar 19 '23 16:03 oobabooga

Lora 100% is supposed to make it deterministic: #419

If it is not then the lora isn't working.

The presence of Lora does not alter the deterministic aspect of your model. Regardless of whether you have Lora or not, you can always modify the reproducibility of your outcomes by adjusting the seed or enabling/disabling the "do_sample" feature.

BadisG avatar Mar 19 '23 16:03 BadisG

Well 4 bit by itself is deterministic. 8/fp16 was not, unless you count producing a stream of unending garbage every time as deterministic. Turning off do_sample allows 8bit to generate without int8 threshold parameter for me.. but text never appeared. So I think that 4bit lora is going to be suspect, especially without do_sample.

about greedy decoding: https://towardsdatascience.com/the-three-decoding-methods-for-nlp-23ca59cb1e9d In short it is :(

Ph0rk0z avatar Mar 19 '23 16:03 Ph0rk0z

Well 4 bit by itself is deterministic. 8/fp16 was not, unless you count producing a stream of unending garbage every time as deterministic. Turning off do_sample allows 8bit to generate without int8 threshold parameter for me.. but text never appeared. So I think that 4bit lora is going to be suspect, especially without do_sample.

about greedy decoding: https://towardsdatascience.com/the-three-decoding-methods-for-nlp-23ca59cb1e9d In short it is :(

when I put "do_sample = False" and I generate 10 times the text with Lora, I got 10 times the same result ("Text LORA" 10 times). The result is exactly the same when I generate 10 times the text without Lora ("Text NO LORA" 10 times)

But of course "Text LORA" and "Text NO LORA" are different to each other, that's the point of a Lora, to give you something different compared to the raw model

BadisG avatar Mar 19 '23 16:03 BadisG

Yes.. but do_sample = False generations are repetitive garbage and you use (NovelAI-Sphinx Moth) in your example. With randomness enabled generation parameters, you can avoid the problems like I had experienced, for a while, too. I really see what that debug preset means when I started using it.

The point of that preset is to be restrictive. Nobody is saying you can't keep using it like this but it still looks broken if it can't even use anything but greedy decoding.

Also, another question, because I have only 1.5 brain cells. Do things like top_p, and temperature even do anything without do sample?

Ph0rk0z avatar Mar 19 '23 17:03 Ph0rk0z

Do things like top_p, and temperature even do anything without do sample?

No they don't, do_sample is the same as greedy sampling.

Back to the original point: I see people claiming to use this 30b LoRA. How? https://huggingface.co/chansung/alpaca-lora-30b

oobabooga avatar Mar 19 '23 17:03 oobabooga

Yes.. but do_sample = False generations are repetitive garbage and you use (NovelAI-Sphinx Moth) in your example. With randomness enabled generation parameters, you can avoid the problems like I had experienced, for a while, too. I really see what that debug preset means when I started using it.

The point of that preset is to be restrictive. Nobody is saying you can't keep using it like this but it still looks broken if it can't even use anything but greedy decoding.

But your "debug preset" also has do_sample = False, that's exactly why it that makes it as a debug preset actually.

The best way to see the reproducibility of an output is to just fix the seed.

On llama.cpp we can do that:

SEED = 1 (Always the same output for a fixed seed) 1

SEED = 2 (Always the same output for a fixed seed) 2

Like that you can have a (do_sample = True) + Fixed seed = Good result that will alaways be the same = Perfect reproducibility

BadisG avatar Mar 19 '23 17:03 BadisG

Do things like top_p, and temperature even do anything without do sample?

No they don't, do_sample is the same as greedy sampling.

Back to the original point: I see people claiming to use this 30b LoRA. How? https://huggingface.co/chansung/alpaca-lora-30b

A6000 48gb? Running it 4bit like he did? Gotta test all and see.

Ph0rk0z avatar Mar 19 '23 17:03 Ph0rk0z

Is there something I need to do to support multi-gpu configuration lora?

image

generic-username0718 avatar Mar 19 '23 18:03 generic-username0718

I think I'm running into this bug https://github.com/huggingface/peft/issues/115#issuecomment-1460706852

Looks like I may need to modify PeftModel.from_pretrained or PeftModelForCausalLM but I'm not sure where...

generic-username0718 avatar Mar 19 '23 19:03 generic-username0718

I think something is broken for int8 split-model lora right now... but not sure where to fix... I think this guy did it... https://github.com/huggingface/peft/issues/115#issuecomment-1441016348

generic-username0718 avatar Mar 19 '23 19:03 generic-username0718

I found a really hacky fix...

I kept on running OOM as the model loads lopsided... so I made the following changes to the modules/LoRA.py file:

  1. replace params['device_map'] = {'': 0} with #params['device_map'] = {'': 0}
  2. add params['max_memory'] = {0: "16GiB", 1: "25GiB"} just below it.

note: replace 16GiB and 25GiB with whatever launch parameter you're sending to "server.py" as the "--gpu-memory" value

image

generic-username0718 avatar Mar 19 '23 21:03 generic-username0718

I've got a new error somehow during the loading of the 13b lora

CUDA SETUP: Loading binary C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\bitsandbyte
s\libbitsandbytes_cuda116.dll...
Adding the LoRA alpaca-lora-13b to the model...
Traceback (most recent call last):
 File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\gradio\routes.py", line 374, i
n run_predict
   output = await app.get_blocks().process_api(
 File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\gradio\blocks.py", line 1017,
in process_api
   result = await self.call_function(
 File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\gradio\blocks.py", line 835, i
n call_function
   prediction = await anyio.to_thread.run_sync(
 File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\anyio\to_thread.py", line 31,
in run_sync
   return await get_asynclib().run_sync_in_worker_thread(
 File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\anyio\_backends\_asyncio.py",
line 937, in run_sync_in_worker_thread
   return await future
 File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\anyio\_backends\_asyncio.py",
line 867, in run
   result = context.run(func, *args)
 File "D:\Large Language Models\text-generation-webui\server.py", line 73, in load_lora_wrapper
   add_lora_to_model(selected_lora)
 File "D:\Large Language Models\text-generation-webui\modules\LoRA.py", line 22, in add_lora_to_mod
el
   shared.model = PeftModel.from_pretrained(shared.model, Path(f"loras/{lora_name}"), **params)
 File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\peft\peft_model.py", line 167,
in from_pretrained
   max_memory = get_balanced_memory(
 File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\accelerate\utils\modeling.py",
line 452, in get_balanced_memory
   per_gpu = module_sizes[""] // (num_devices - 1 if low_zero else num_devices)
ZeroDivisionError: integer division or modulo by zero

I fixed it by changing th modeling.py file that is on this package: C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\accelerate\utils\modeling.py

On line 452 you replace this: per_gpu = module_sizes[""] // (num_devices - 1 if low_zero else num_devices)

By this: per_gpu = module_sizes[""] // (num_devices - 1 if low_zero else num_devices) if num_devices != 0 else 0

BadisG avatar Mar 19 '23 22:03 BadisG