text-generation-webui
text-generation-webui copied to clipboard
"Save every n steps" in training cause an CUDA out of memory
Describe the bug
# Environment
GPU: RTX 4090 24GB VRAM
Memory: 32GB RAM
CPU: i-13700K
# Command
python server.py --auto-devices --chat --model-menu --gpu-memory 21GiB 21GiB --cpu-memory 24000MiB --load-in-8bit
I had no issue with llama 7B training with "Save every 100 steps" but had OOM issue with 13B with "Save every 100 steps". Obviously, this feature uses CUDA memory when it creates a checkpoint.
It was originally 1000 steps and reduced to 100 steps to verify the bug. I believe you can reproduce the bug with 5 steps.
# Log on UI
Running… 128 / 7296 … 1.10 s/it, 2 minutes / 2 hours … 2 hours remaining
# Log on Console
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 68.00 MiB (GPU 0; 23.99 GiB total capacity; 22.07 GiB already allocated; 0 bytes free; 22.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
# Files in the "checkpoint-128" folder
None.
Could you optimize the "Save every 100 steps" feature? So far the only solution is that training data set with 13B model without saving checkpoint. (Oh wait... I think the final result cannot be produced because of this.)
Is there an existing issue for this?
- [X] I have searched the existing issues
Reproduction
Refer https://github.com/bublint/ue5-llama-lora
- Try 7B and try 13B with RTX 4090
- Set the
Save every n steps
value to 5 steps. - Run the training and wait for 10 mins
- Check out the OOM message and the checkpoint folder
Screenshot
No response
Logs
To create a public link, set `share=True` in `launch()`.
Loading raw text file dataset...
Getting model ready...
Prepping for training...
Creating LoRA model...
Starting training...
Exception in thread Thread-4 (threaded_run):
Traceback (most recent call last):
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\threading.py", line 1016, in _bootstrap_inner
self.run()
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\text-generation-webui\modules\training.py", line 414, in threaded_run
trainer.train()
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\transformers\trainer.py", line 1662, in train
return inner_training_loop(
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\transformers\trainer.py", line 1918, in _inner_training_loop
self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\transformers\trainer_callback.py", line 369, in on_step_begin
return self.call_event("on_step_begin", args, state, control)
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\transformers\trainer_callback.py", line 397, in call_event
result = getattr(callback, event)(
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\text-generation-webui\modules\training.py", line 361, in on_step_begin
lora_model.save_pretrained(f"{lora_file_path}/checkpoint-{tracked.current_steps}/")
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\peft\peft_model.py", line 116, in save_pretrained
output_state_dict = get_peft_model_state_dict(
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\peft\utils\save_and_load.py", line 32, in get_peft_model_state_dict
state_dict = model.state_dict()
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1818, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1818, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1818, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
[Previous line repeated 4 more times]
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1815, in state_dict
self._save_to_state_dict(destination, prefix, keep_vars)
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\bitsandbytes\nn\modules.py", line 268, in _save_to_state_dict
self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\bitsandbytes\autograd\_functions.py", line 100, in undo_layout
return outputs.reshape(rows, cols).contiguous()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 68.00 MiB (GPU 0; 23.99 GiB total capacity; 22.07 GiB already allocated; 0 bytes free; 22.95 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Training complete, saving...
Traceback (most recent call last):
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\gradio\routes.py", line 395, in run_predict
output = await app.get_blocks().process_api(
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 1193, in process_api
result = await self.call_function(
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 930, in call_function
prediction = await anyio.to_thread.run_sync(
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
result = context.run(func, *args)
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\gradio\utils.py", line 491, in async_iteration
return next(iterator)
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\text-generation-webui\modules\training.py", line 450, in do_train
lora_model.save_pretrained(lora_file_path)
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\peft\peft_model.py", line 116, in save_pretrained
output_state_dict = get_peft_model_state_dict(
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\peft\utils\save_and_load.py", line 32, in get_peft_model_state_dict
state_dict = model.state_dict()
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1818, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1818, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1818, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
[Previous line repeated 4 more times]
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1815, in state_dict
self._save_to_state_dict(destination, prefix, keep_vars)
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\bitsandbytes\nn\modules.py", line 268, in _save_to_state_dict
self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)
File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\bitsandbytes\autograd\_functions.py", line 100, in undo_layout
return outputs.reshape(rows, cols).contiguous()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 68.00 MiB (GPU 0; 23.99 GiB total capacity; 22.07 GiB already allocated; 0 bytes free; 22.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
System Info
GPU: RTX 4090 24GB VRAM
Memory: 32GB RAM
CPU: i-13700K
In my case it was from bitsandbytes .
When I used the bitsandbytes==0.37.2 version, there was no problem.
See issues below.
https://github.com/TimDettmers/bitsandbytes/issues/324
In my case it was from bitsandbytes .
When I used the bitsandbytes==0.37.2 version, there was no problem.
See issues below.
Easy fix for me on Ubuntu, thanks!
This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.
Hey Guys, I'm facing the exact same issue with Llama-2-70B checkpoint on A100 gpus. in my case I don't think it has to do with bitsandbytes because I'm not suing quantization, but despite that, I downgraded to 0.37 as above and that did not solve the issue, any thoughts?