text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

"Save every n steps" in training cause an CUDA out of memory

Open yunghoy opened this issue 1 year ago • 2 comments

Describe the bug

# Environment
GPU: RTX 4090 24GB VRAM
Memory: 32GB RAM
CPU: i-13700K

# Command
python server.py --auto-devices --chat --model-menu --gpu-memory 21GiB 21GiB --cpu-memory 24000MiB --load-in-8bit

I had no issue with llama 7B training with "Save every 100 steps" but had OOM issue with 13B with "Save every 100 steps". Obviously, this feature uses CUDA memory when it creates a checkpoint.

It was originally 1000 steps and reduced to 100 steps to verify the bug. I believe you can reproduce the bug with 5 steps.

# Log on UI
Running… 128 / 7296 … 1.10 s/it, 2 minutes / 2 hours … 2 hours remaining

# Log on Console
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 68.00 MiB (GPU 0; 23.99 GiB total capacity; 22.07 GiB already allocated; 0 bytes free; 22.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

# Files in the "checkpoint-128" folder
None.

Could you optimize the "Save every 100 steps" feature? So far the only solution is that training data set with 13B model without saving checkpoint. (Oh wait... I think the final result cannot be produced because of this.)

Is there an existing issue for this?

  • [X] I have searched the existing issues

Reproduction

Refer https://github.com/bublint/ue5-llama-lora

  1. Try 7B and try 13B with RTX 4090
  2. Set the Save every n steps value to 5 steps.
  3. Run the training and wait for 10 mins
  4. Check out the OOM message and the checkpoint folder

Screenshot

No response

Logs

To create a public link, set `share=True` in `launch()`.
Loading raw text file dataset...
Getting model ready...
Prepping for training...
Creating LoRA model...
Starting training...
Exception in thread Thread-4 (threaded_run):
Traceback (most recent call last):
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\text-generation-webui\modules\training.py", line 414, in threaded_run
    trainer.train()
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\transformers\trainer.py", line 1662, in train
    return inner_training_loop(
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\transformers\trainer.py", line 1918, in _inner_training_loop
    self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\transformers\trainer_callback.py", line 369, in on_step_begin
    return self.call_event("on_step_begin", args, state, control)
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\transformers\trainer_callback.py", line 397, in call_event
    result = getattr(callback, event)(
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\text-generation-webui\modules\training.py", line 361, in on_step_begin
    lora_model.save_pretrained(f"{lora_file_path}/checkpoint-{tracked.current_steps}/")
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\peft\peft_model.py", line 116, in save_pretrained
    output_state_dict = get_peft_model_state_dict(
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\peft\utils\save_and_load.py", line 32, in get_peft_model_state_dict
    state_dict = model.state_dict()
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  [Previous line repeated 4 more times]
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1815, in state_dict
    self._save_to_state_dict(destination, prefix, keep_vars)
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\bitsandbytes\nn\modules.py", line 268, in _save_to_state_dict
    self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\bitsandbytes\autograd\_functions.py", line 100, in undo_layout
    return outputs.reshape(rows, cols).contiguous()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 68.00 MiB (GPU 0; 23.99 GiB total capacity; 22.07 GiB already allocated; 0 bytes free; 22.95 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Training complete, saving...
Traceback (most recent call last):
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\gradio\routes.py", line 395, in run_predict
    output = await app.get_blocks().process_api(
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 1193, in process_api
    result = await self.call_function(
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 930, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\gradio\utils.py", line 491, in async_iteration
    return next(iterator)
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\text-generation-webui\modules\training.py", line 450, in do_train
    lora_model.save_pretrained(lora_file_path)
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\peft\peft_model.py", line 116, in save_pretrained
    output_state_dict = get_peft_model_state_dict(
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\peft\utils\save_and_load.py", line 32, in get_peft_model_state_dict
    state_dict = model.state_dict()
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  [Previous line repeated 4 more times]
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1815, in state_dict
    self._save_to_state_dict(destination, prefix, keep_vars)
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\bitsandbytes\nn\modules.py", line 268, in _save_to_state_dict
    self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)
  File "C:\Users\MYUSER_NAME\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\bitsandbytes\autograd\_functions.py", line 100, in undo_layout
    return outputs.reshape(rows, cols).contiguous()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 68.00 MiB (GPU 0; 23.99 GiB total capacity; 22.07 GiB already allocated; 0 bytes free; 22.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

System Info

GPU: RTX 4090 24GB VRAM
Memory: 32GB RAM
CPU: i-13700K

yunghoy avatar Apr 23 '23 05:04 yunghoy

In my case it was from bitsandbytes .

When I used the bitsandbytes==0.37.2 version, there was no problem.

See issues below.

https://github.com/TimDettmers/bitsandbytes/issues/324

ark1st avatar Apr 24 '23 05:04 ark1st

In my case it was from bitsandbytes .

When I used the bitsandbytes==0.37.2 version, there was no problem.

See issues below.

TimDettmers/bitsandbytes#324

Easy fix for me on Ubuntu, thanks!

disarmyouwitha avatar May 06 '23 14:05 disarmyouwitha

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

github-actions[bot] avatar Aug 29 '23 23:08 github-actions[bot]

Hey Guys, I'm facing the exact same issue with Llama-2-70B checkpoint on A100 gpus. in my case I don't think it has to do with bitsandbytes because I'm not suing quantization, but despite that, I downgraded to 0.37 as above and that did not solve the issue, any thoughts?

hassanzadeh avatar Oct 16 '23 17:10 hassanzadeh