simple-llm-finetuner icon indicating copy to clipboard operation
simple-llm-finetuner copied to clipboard

Getting OOM

Open alior101 opened this issue 1 year ago • 2 comments

Training on T4:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB (GPU 0; 14.56 GiB total capacity; 13.25 GiB already allocated; 10.44 MiB free; 13.83 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I suspect a change of verisons in peft or transformers ... Does it make sense ?

alior101 avatar Apr 12 '23 19:04 alior101

Same. This didn't use to happen.

{'loss': 1.1006, 'learning_rate': 2.748091603053435e-05, 'epoch': 0.92}
{'train_runtime': 350.2814, 'train_samples_per_second': 0.374, 'train_steps_per_second': 0.374, 'train_loss': 1.0609159615203625, 'epoch': 1.0}
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/gradio/routes.py", line 395, in run_predict
    output = await app.get_blocks().process_api(
  File "/usr/local/lib/python3.9/dist-packages/gradio/blocks.py", line 1193, in process_api
    result = await self.call_function(
  File "/usr/local/lib/python3.9/dist-packages/gradio/blocks.py", line 916, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.9/dist-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.9/dist-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.9/dist-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.9/dist-packages/gradio/helpers.py", line 588, in tracked_fn
    response = fn(*args)
  File "/content/simple-llama-finetuner/main.py", line 253, in tokenize_and_train
    model.save_pretrained(output_dir)
  File "/usr/local/lib/python3.9/dist-packages/peft/peft_model.py", line 116, in save_pretrained
    output_state_dict = get_peft_model_state_dict(
  File "/usr/local/lib/python3.9/dist-packages/peft/utils/save_and_load.py", line 32, in get_peft_model_state_dict
    state_dict = model.state_dict()
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  [Previous line repeated 4 more times]
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1815, in state_dict
    self._save_to_state_dict(destination, prefix, keep_vars)
  File "/usr/local/lib/python3.9/dist-packages/bitsandbytes/nn/modules.py", line 268, in _save_to_state_dict
    self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)
  File "/usr/local/lib/python3.9/dist-packages/bitsandbytes/autograd/_functions.py", line 96, in undo_layout
    outputs = torch.empty_like(tensor)  # note: not using .index_copy because it was slower on cuda
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 68.00 MiB (GPU 0; 39.56 GiB total capacity; 35.96 GiB already allocated; 4.56 MiB free; 37.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Keyboard interruption in main thread... closing server.

MillionthOdin16 avatar Apr 13 '23 01:04 MillionthOdin16

My case was the bitsandbytes error.

Referring to the issue below, using the bitsandbytes==0.37.2 version, the problem does not occur.

https://github.com/TimDettmers/bitsandbytes/issues/324

ark1st avatar Apr 24 '23 05:04 ark1st