text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

Lora Training fails to save checkpoint

Open hypersniper05 opened this issue 1 year ago • 11 comments

Describe the bug

I am able to train but as soon as it tries to save the checkpoint I get the following error. This only occurs on the new installer version with the webui.py, the previous version saves the checkpoints ok.

Is there an existing issue for this?

  • [X] I have searched the existing issues

Reproduction

Install new version of text-generation-webui for windows installer

Screenshot

No response

Logs

Exception in thread Thread-7 (threaded_run):
Traceback (most recent call last):
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\text-generation-webui\modules\training.py", line 416, in threaded_run
    trainer.train()
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\transformers\trainer.py", line 1662, in train
    return inner_training_loop(
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\transformers\trainer.py", line 1918, in _inner_training_loop
    self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\transformers\trainer_callback.py", line 369, in on_step_begin
    return self.call_event("on_step_begin", args, state, control)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\transformers\trainer_callback.py", line 397, in call_event
    result = getattr(callback, event)(
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\text-generation-webui\modules\training.py", line 363, in on_step_begin
    lora_model.save_pretrained(f"{lora_file_path}/checkpoint-{tracked.current_steps}/")
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\peft\peft_model.py", line 125, in save_pretrained
    output_state_dict = get_peft_model_state_dict(
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\peft\utils\save_and_load.py", line 32, in get_peft_model_state_dict
    state_dict = model.state_dict()
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  [Previous line repeated 4 more times]
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1815, in state_dict
    self._save_to_state_dict(destination, prefix, keep_vars)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\bitsandbytes\nn\modules.py", line 268, in _save_to_state_dict
    self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\bitsandbytes\autograd\_functions.py", line 100, in undo_layout
    return outputs.reshape(rows, cols).contiguous()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 12.00 GiB total capacity; 10.64 GiB already allocated; 0 bytes free; 11.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Training complete, saving...
Traceback (most recent call last):
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\gradio\routes.py", line 395, in run_predict
    output = await app.get_blocks().process_api(
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\gradio\blocks.py", line 1193, in process_api
    result = await self.call_function(
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\gradio\blocks.py", line 930, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\gradio\utils.py", line 491, in async_iteration
    return next(iterator)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\text-generation-webui\modules\training.py", line 452, in do_train
    lora_model.save_pretrained(lora_file_path)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\peft\peft_model.py", line 125, in save_pretrained
    output_state_dict = get_peft_model_state_dict(
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\peft\utils\save_and_load.py", line 32, in get_peft_model_state_dict
    state_dict = model.state_dict()
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  [Previous line repeated 4 more times]
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1815, in state_dict
    self._save_to_state_dict(destination, prefix, keep_vars)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\bitsandbytes\nn\modules.py", line 268, in _save_to_state_dict
    self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)
  File "C:\Users\ermar\OneDrive\Desktop\LLM_Folder\installer_files\env\lib\site-packages\bitsandbytes\autograd\_functions.py", line 100, in undo_layout
    return outputs.reshape(rows, cols).contiguous()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB (GPU 0; 12.00 GiB total capacity; 10.62 GiB already allocated; 0 bytes free; 11.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

System Info

3080Ti , Running Local

hypersniper05 avatar Apr 30 '23 04:04 hypersniper05

Also getting this using 4090

RedNax67 avatar May 02 '23 10:05 RedNax67

@mcmonkey4eva

oobabooga avatar May 02 '23 22:05 oobabooga

See error report @ https://github.com/TimDettmers/bitsandbytes/issues/324

Users previously reported that pip install bitsandbytes==0.37.2 avoids the OOM issue, albeit it's a pain to install on windows

mcmonkey4eva avatar May 03 '23 00:05 mcmonkey4eva

See error report @ TimDettmers/bitsandbytes#324

Users previously reported that pip install bitsandbytes==0.37.2 avoids the OOM issue, albeit it's a pain to install on windows

On windows native this made things worse for me. On WSL it seems to have resolved the issue.

RedNax67 avatar May 03 '23 06:05 RedNax67

I also confirm doing pip install bitsandbytes==0.37.2 messes it up for me on Windows, but fixed the problem in WSL.

But I also have to do an additional step in WSL, which is replacing bitsandbytes_cpu.so with bitsandbytes_cuda117.so in \my_user_name\miniconda3\envs\my_env_name\lib\python3.10\site-packages\bitsandbytes . Without this step, I cannot load models in 8bit in order to train LoRA.

chrisadas avatar May 05 '23 10:05 chrisadas

See error report @ TimDettmers/bitsandbytes#324

Users previously reported that pip install bitsandbytes==0.37.2 avoids the OOM issue, albeit it's a pain to install on windows

This helped me, thank you very much!

disarmyouwitha avatar May 06 '23 14:05 disarmyouwitha

LORA training fails with a Torch Out Of Memory error at the end of training (10 seconds left, see error message at the end), but if I set bitsandbytes to 0.37.2 even text generation fails. (on Linux, on a RTX3080, 10GB).

Error when LORA Training is supposed to be saved.

text-generation-webui-text-generation-webui-1 | warnings.warn( text-generation-webui-text-generation-webui-1 | cuBLAS API failed with status 15 text-generation-webui-text-generation-webui-1 | A: torch.Size([33, 2560]), B: torch.Size([7680, 2560]), C: (33, 7680); (lda, ldb, ldc): (c_int(1056), c_int(245760), c_int(1056)); (m, n, k): (c_int(33), c_int(7680), c_int(2560)) text-generation-webui-text-generation-webui-1 | Traceback (most recent call last): text-generation-webui-text-generation-webui-1 | File "/app/modules/callbacks.py", line 73, in gentask text-generation-webui-text-generation-webui-1 | ret = self.mfunc(callback=_callback, **self.kwargs) text-generation-webui-text-generation-webui-1 | File "/app/modules/text_generation.py", line 251, in generate_with_callback text-generation-webui-text-generation-webui-1 | shared.model.generate(**kwargs) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context text-generation-webui-text-generation-webui-1 | return func(*args, **kwargs) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1485, in generate text-generation-webui-text-generation-webui-1 | return self.sample( text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2524, in sample text-generation-webui-text-generation-webui-1 | outputs = self( text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl text-generation-webui-text-generation-webui-1 | return forward_call(*args, **kwargs) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward text-generation-webui-text-generation-webui-1 | output = old_forward(*args, **kwargs) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 662, in forward text-generation-webui-text-generation-webui-1 | outputs = self.gpt_neox( text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl text-generation-webui-text-generation-webui-1 | return forward_call(*args, **kwargs) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward text-generation-webui-text-generation-webui-1 | output = old_forward(*args, **kwargs) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 545, in forward text-generation-webui-text-generation-webui-1 | outputs = torch.utils.checkpoint.checkpoint( text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint text-generation-webui-text-generation-webui-1 | return CheckpointFunction.apply(function, preserve, *args) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply text-generation-webui-text-generation-webui-1 | return super().apply(*args, **kwargs) # type: ignore[misc] text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward text-generation-webui-text-generation-webui-1 | outputs = run_function(*args) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 541, in custom_forward text-generation-webui-text-generation-webui-1 | return module(*inputs, use_cache, None, output_attentions) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl text-generation-webui-text-generation-webui-1 | return forward_call(*args, **kwargs) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward text-generation-webui-text-generation-webui-1 | output = old_forward(*args, **kwargs) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 320, in forward text-generation-webui-text-generation-webui-1 | attention_layer_outputs = self.attention( text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl text-generation-webui-text-generation-webui-1 | return forward_call(*args, **kwargs) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward text-generation-webui-text-generation-webui-1 | output = old_forward(*args, **kwargs) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 116, in forward text-generation-webui-text-generation-webui-1 | qkv = self.query_key_value(hidden_states) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl text-generation-webui-text-generation-webui-1 | return forward_call(*args, **kwargs) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/peft/tuners/lora.py", line 698, in forward text-generation-webui-text-generation-webui-1 | result = super().forward(x) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 242, in forward text-generation-webui-text-generation-webui-1 | out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul text-generation-webui-text-generation-webui-1 | return MatMul8bitLt.apply(A, B, out, bias, state) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply text-generation-webui-text-generation-webui-1 | return super().apply(*args, **kwargs) # type: ignore[misc] text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 377, in forward text-generation-webui-text-generation-webui-1 | out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB) text-generation-webui-text-generation-webui-1 | File "/app/venv/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1410, in igemmlt text-generation-webui-text-generation-webui-1 | raise Exception('cublasLt ran into an error!') text-generation-webui-text-generation-webui-1 | Exception: cublasLt ran into an error! text-generation-webui-text-generation-webui-1 | error detectedOutput generated in 0.24 seconds (0.00 tokens/s, 0 tokens, context 33, seed 478163328)

sammyf avatar May 07 '23 07:05 sammyf

Training needs some attention on windows, lora adapters is also very buggy (adding , removing, stacking) . Please give this some priority. Thank you

hypersniper05 avatar May 07 '23 15:05 hypersniper05

@hypersniper05 I wish I could do more, but the problem isn't that text-gen-webui doesn't work on windows (it works fine in itself) it's that the upstream libraries we depend on for the internal parts aren't well-tested on windows. BitsAndBytes in particular is a bit infamous at this point for how unstable their windows compat is, and is the core of the issue here. I'll test whether it's possible to bypass b&b entirely. It might be?


Test-ran training on windows. Just ran latest one-click installer, followed the monkeypatch install guide (only thing that was weird was having to shove run_cmd("python -m pip install git+https://github.com/sterlind/GPTQ-for-LLaMa.git@lora_4bit") into webui.py to do the install rather than figuring out the 'proper' way to run miniconda pip installs lol). Ran perfectly, didn't even get a VRAM spike from the save for that matter. Everything 'just worked' for me.


I uninstalled bitsandbytes and training in 4-bit monkeypatch on windows just, works anyway. As long as you're not actually using 8-bit mode I think you can just get rid of it and be good?

mcmonkey4eva avatar May 07 '23 22:05 mcmonkey4eva

as for Linux : replace bitsandbytes in requirement.txt with bitsandbytes==0.37.0 to make it work! 0.37.2 seems to be buggy.

sammyf avatar May 08 '23 04:05 sammyf

@sammyf

as for Linux : replace bitsandbytes in requirement.txt with bitsandbytes==0.37.0 to make it work! 0.37.2 seems to be buggy.

Thank you, I've tried bitsandbytes 0.37.2 . And OOM issue is gone. But may I ask what sort of buggy things did you mean? (it run alright for me but I wonder if I should switch to 0.37.0 as you suggested)

the-unsoul avatar May 22 '23 10:05 the-unsoul

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

github-actions[bot] avatar Aug 26 '23 23:08 github-actions[bot]