text-generation-webui Lora Training fails to save checkpoint

Describe the bug

I am able to train but as soon as it tries to save the checkpoint I get the following error. This only occurs on the new installer version with the webui.py, the previous version saves the checkpoints ok.

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

Install new version of text-generation-webui for windows installer

Screenshot

No response

Logs

Exception in thread Thread-7 (threaded_run):
Traceback (most recent call last):
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\text-generation-webui\modules\training.py", line 416, in threaded_run
    trainer.train()
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\transformers\trainer.py", line 1662, in train
    return inner_training_loop(
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\transformers\trainer.py", line 1918, in _inner_training_loop
    self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\transformers\trainer_callback.py", line 369, in on_step_begin
    return self.call_event("on_step_begin", args, state, control)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\transformers\trainer_callback.py", line 397, in call_event
    result = getattr(callback, event)(
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\text-generation-webui\modules\training.py", line 363, in on_step_begin
    lora_model.save_pretrained(f"{lora_file_path}/checkpoint-{tracked.current_steps}/")
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\peft\peft_model.py", line 125, in save_pretrained
    output_state_dict = get_peft_model_state_dict(
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\peft\utils\save_and_load.py", line 32, in get_peft_model_state_dict
    state_dict = model.state_dict()
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  [Previous line repeated 4 more times]
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1815, in state_dict
    self._save_to_state_dict(destination, prefix, keep_vars)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\bitsandbytes\nn\modules.py", line 268, in _save_to_state_dict
    self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\bitsandbytes\autograd\_functions.py", line 100, in undo_layout
    return outputs.reshape(rows, cols).contiguous()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 12.00 GiB total capacity; 10.64 GiB already allocated; 0 bytes free; 11.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Training complete, saving...
Traceback (most recent call last):
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\gradio\routes.py", line 395, in run_predict
    output = await app.get_blocks().process_api(
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\gradio\blocks.py", line 1193, in process_api
    result = await self.call_function(
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\gradio\blocks.py", line 930, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\gradio\utils.py", line 491, in async_iteration
    return next(iterator)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\text-generation-webui\modules\training.py", line 452, in do_train
    lora_model.save_pretrained(lora_file_path)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\peft\peft_model.py", line 125, in save_pretrained
    output_state_dict = get_peft_model_state_dict(
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\peft\utils\save_and_load.py", line 32, in get_peft_model_state_dict
    state_dict = model.state_dict()
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  [Previous line repeated 4 more times]
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1815, in state_dict
    self._save_to_state_dict(destination, prefix, keep_vars)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\bitsandbytes\nn\modules.py", line 268, in _save_to_state_dict
    self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\bitsandbytes\autograd\_functions.py", line 100, in undo_layout
    return outputs.reshape(rows, cols).contiguous()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB (GPU 0; 12.00 GiB total capacity; 10.62 GiB already allocated; 0 bytes free; 11.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

System Info

3080Ti , Running Local

Apr 30 '23 04:04 hypersniper05

Also getting this using 4090

May 02 '23 10:05 RedNax67

@mcmonkey4eva

May 02 '23 22:05 oobabooga

See error report @ https://github.com/TimDettmers/bitsandbytes/issues/324

Users previously reported that pip install bitsandbytes==0.37.2 avoids the OOM issue, albeit it's a pain to install on windows

May 03 '23 00:05 mcmonkey4eva

See error report @ TimDettmers/bitsandbytes#324

Users previously reported that pip install bitsandbytes==0.37.2 avoids the OOM issue, albeit it's a pain to install on windows

On windows native this made things worse for me. On WSL it seems to have resolved the issue.

May 03 '23 06:05 RedNax67

I also confirm doing pip install bitsandbytes==0.37.2 messes it up for me on Windows, but fixed the problem in WSL.

But I also have to do an additional step in WSL, which is replacing bitsandbytes_cpu.so with bitsandbytes_cuda117.so in \my_user_name\miniconda3\envs\my_env_name\lib\python3.10\site-packages\bitsandbytes . Without this step, I cannot load models in 8bit in order to train LoRA.

May 05 '23 10:05 chrisadas

See error report @ TimDettmers/bitsandbytes#324

Users previously reported that pip install bitsandbytes==0.37.2 avoids the OOM issue, albeit it's a pain to install on windows

This helped me, thank you very much!

May 06 '23 14:05 disarmyouwitha

LORA training fails with a Torch Out Of Memory error at the end of training (10 seconds left, see error message at the end), but if I set bitsandbytes to 0.37.2 even text generation fails. (on Linux, on a RTX3080, 10GB).

Error when LORA Training is supposed to be saved.

text-generation-webui-text-generation-webui-1 | warnings.warn( text-generation-webui-text-generation-webui-1 | cuBLAS API failed with status 15 text-generation-webui-text-generation-webui-1 | A: torch.Size([33, 2560]), B: torch.Size([7680, 2560]), C: (33, 7680); (lda, ldb, ldc): (c_int(1056), c_int(245760), c_int(1056)); (m, n, k): (c_int(33), c_int(7680), c_int(2560)) text-generation-webui-text-generation-webui-1 | Traceback (most recent call last): text-generation-webui-text-generation-webui-1 | File "/app/modules/callbacks.py", line 73, in gentask text-generation-webui-text-generation-webui-1 | ret = self.mfunc(callback=_callback, **self.kwargs) text-generation-webui-text-generation-webui-1 | File "/app/modules/text_generation.py", line 251, in generate_with_callback text-generation-webui-text-generation-webui-1 | shared.model.generate(**kwargs) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context text-generation-webui-text-generation-webui-1 | return func(*args, **kwargs) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1485, in generate text-generation-webui-text-generation-webui-1 | return self.sample( text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2524, in sample text-generation-webui-text-generation-webui-1 | outputs = self( text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl text-generation-webui-text-generation-webui-1 | return forward_call(*args, **kwargs) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward text-generation-webui-text-generation-webui-1 | output = old_forward(*args, **kwargs) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 662, in forward text-generation-webui-text-generation-webui-1 | outputs = self.gpt_neox( text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl text-generation-webui-text-generation-webui-1 | return forward_call(*args, **kwargs) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward text-generation-webui-text-generation-webui-1 | output = old_forward(*args, **kwargs) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 545, in forward text-generation-webui-text-generation-webui-1 | outputs = torch.utils.checkpoint.checkpoint( text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint text-generation-webui-text-generation-webui-1 | return CheckpointFunction.apply(function, preserve, *args) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply text-generation-webui-text-generation-webui-1 | return super().apply(*args, **kwargs) # type: ignore[misc] text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward text-generation-webui-text-generation-webui-1 | outputs = run_function(*args) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 541, in custom_forward text-generation-webui-text-generation-webui-1 | return module(*inputs, use_cache, None, output_attentions) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl text-generation-webui-text-generation-webui-1 | return forward_call(*args, **kwargs) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward text-generation-webui-text-generation-webui-1 | output = old_forward(*args, **kwargs) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 320, in forward text-generation-webui-text-generation-webui-1 | attention_layer_outputs = self.attention( text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl text-generation-webui-text-generation-webui-1 | return forward_call(*args, **kwargs) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward text-generation-webui-text-generation-webui-1 | output = old_forward(*args, **kwargs) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 116, in forward text-generation-webui-text-generation-webui-1 | qkv = self.query_key_value(hidden_states) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl text-generation-webui-text-generation-webui-1 | return forward_call(*args, **kwargs) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/peft/tuners/lora.py", line 698, in forward text-generation-webui-text-generation-webui-1 | result = super().forward(x) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 242, in forward text-generation-webui-text-generation-webui-1 | out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul text-generation-webui-text-generation-webui-1 | return MatMul8bitLt.apply(A, B, out, bias, state) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply text-generation-webui-text-generation-webui-1 | return super().apply(*args, **kwargs) # type: ignore[misc] text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 377, in forward text-generation-webui-text-generation-webui-1 | out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1410, in igemmlt text-generation-webui-text-generation-webui-1 | raise Exception('cublasLt ran into an error!') text-generation-webui-text-generation-webui-1 | Exception: cublasLt ran into an error! text-generation-webui-text-generation-webui-1 | error detectedOutput generated in 0.24 seconds (0.00 tokens/s, 0 tokens, context 33, seed 478163328)

May 07 '23 07:05 sammyf

Training needs some attention on windows, lora adapters is also very buggy (adding , removing, stacking) . Please give this some priority. Thank you

May 07 '23 15:05 hypersniper05

@hypersniper05 I wish I could do more, but the problem isn't that text-gen-webui doesn't work on windows (it works fine in itself) it's that the upstream libraries we depend on for the internal parts aren't well-tested on windows. BitsAndBytes in particular is a bit infamous at this point for how unstable their windows compat is, and is the core of the issue here. I'll test whether it's possible to bypass b&b entirely. It might be?

Test-ran training on windows. Just ran latest one-click installer, followed the monkeypatch install guide (only thing that was weird was having to shove run_cmd("python -m pip install git+https://github.com/sterlind/GPTQ-for-LLaMa.git@lora_4bit") into webui.py to do the install rather than figuring out the 'proper' way to run miniconda pip installs lol). Ran perfectly, didn't even get a VRAM spike from the save for that matter. Everything 'just worked' for me.

I uninstalled bitsandbytes and training in 4-bit monkeypatch on windows just, works anyway. As long as you're not actually using 8-bit mode I think you can just get rid of it and be good?

May 07 '23 22:05 mcmonkey4eva

as for Linux : replace bitsandbytes in requirement.txt with bitsandbytes==0.37.0 to make it work! 0.37.2 seems to be buggy.

May 08 '23 04:05 sammyf

@sammyf

as for Linux : replace bitsandbytes in requirement.txt with bitsandbytes==0.37.0 to make it work! 0.37.2 seems to be buggy.

Thank you, I've tried bitsandbytes 0.37.2 . And OOM issue is gone. But may I ask what sort of buggy things did you mean? (it run alright for me but I wonder if I should switch to 0.37.0 as you suggested)

May 22 '23 10:05 the-unsoul

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

Aug 26 '23 23:08 github-actions[bot]

text-generation-webui text-generation-webui copied to clipboard

Lora Training fails to save checkpoint

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

Error when LORA Training is supposed to be saved.

text-generation-webui
text-generation-webui copied to clipboard