axolotl Mistral Nemo 12B training CUDA Out of memory only when enabling EVAL. On 2x3090Ti FSDP.

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Eval or no eval shouldn't cause a difference in memory use and cause the training to fail when eval is enabled.

Current behaviour

I can start and train Mistral Nemo 12B just fine but it crashes when returning to training after an eval. If I disable eval entirely then the training works just fine.

This is the result of training Mistral Nemo 12B Instruct when enabling eval. This error comes up after going back into training after finishing the first eval,

packing_efficiency_estimate: 0.89 total_num_tokens per device: 277808004                        
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__ca
ll__` method is faster than using a method to encode the text followed by a call to the `pad` metho
d to get a padded encoding.                                                                        
{'loss': 2.4432, 'grad_norm': 1.6465606689453125, 'learning_rate': 5.000000000000001e-07, 'epoch': 
0.0}                                                                                               
  0%|                                                        | 1/4715 [01:50<145:19:23, 110.98s/it[
2024-08-06 07:02:40,567] [INFO] [accelerate.accelerator.gather_for_metrics:2406] [PID:14784] The u$
ed dataset had no length, returning gathered tensors. You should drop the remainder yourself.      
{'eval_loss': 1.7877388000488281, 'eval_runtime': 482.9937, 'eval_samples_per_second': 1.035, 'eval
_steps_per_second': 0.518, 'epoch': 0.0}                                                           
  0%|                                                        | 1/4715 [09:53<145:19:23, 110.98s/itT
raceback (most recent call last):                                                                  
[rank1]: Traceback (most recent call last):                                                        
[rank1]:   File "<frozen runpy>", line 198, in _run_module_as_main                                 
[rank1]:   File "<frozen runpy>", line 88, in _run_code                                            
[rank1]:   File "/home/owen/axolotl/src/axolotl/cli/train.py", line 72, in <module>                
[rank1]:     fire.Fire(do_cli)                                                                     
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", lin
e 143, in Fire                                                                                     
[rank1]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)             
[rank1]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^             
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", lin
e 477, in _Fire                                                                                    
[rank1]:     component, remaining_args = _CallAndUpdateTrace(                                      
[rank1]:                                 ^^^^^^^^^^^^^^^^^^^^                                      
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", lin
e 693, in _CallAndUpdateTrace                                                                      
[rank1]:     component = fn(*varargs, **kwargs)                                                    
[rank1]:                 ^^^^^^^^^^^^^^^^^^^^^^                                                    
[rank1]:   File "/home/owen/axolotl/src/axolotl/cli/train.py", line 39, in do_cli                  
[rank1]:     return do_train(parsed_cfg, parsed_cli_args)                                          
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                          
[rank1]:   File "/home/owen/axolotl/src/axolotl/cli/train.py", line 67, in do_train                
[rank1]:     return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)                   
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                   
[rank1]:   File "/home/owen/axolotl/src/axolotl/train.py", line 187, in train                      
[rank1]:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)                          
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/train
er.py", line 1938, in train                                                                        
[rank1]:     return inner_training_loop(                                                           
[rank1]:            ^^^^^^^^^^^^^^^^^^^^                                                           
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/train
er.py", line 2279, in _inner_training_loop                                                         
[rank1]:     tr_loss_step = self.training_step(model, inputs)                                      
[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                      
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/train
er.py", line 3349, in training_step                                                                
[rank1]:     self.accelerator.backward(loss, **kwargs)                                             
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/accelerator.py", line 2151, in backward
[rank1]:     loss.backward(**kwargs)
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/_tensor.py",
 line 525, in backward
[rank1]:     torch.autograd.backward(
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/autograd/__i
nit__.py", line 267, in backward
[rank1]:     _engine_run_backward(
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/autograd/gra
ph.py", line 744, in _engine_run_backward
[rank1]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run th
e backward pass
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^
[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU  has a to
tal capacity of 23.67 GiB of which 1.59 GiB is free. Including non-PyTorch memory, this process has
 22.06 GiB memory in use. Of the allocated memory 19.75 GiB is allocated by PyTorch, and 1.78 GiB i
s reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTO
RCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory 
Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/owen/axolotl/src/axolotl/cli/train.py", line 72, in <module>
    fire.Fire(do_cli)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 143, in
 Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 477, in
 _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 693, in
 _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/axolotl/src/axolotl/cli/train.py", line 39, in do_cli
    return do_train(parsed_cfg, parsed_cli_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/axolotl/src/axolotl/cli/train.py", line 67, in do_train
    return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/axolotl/src/axolotl/train.py", line 187, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", l
ine 1938, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", l
ine 2279, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", l
ine 3349, in training_step
    self.accelerator.backward(loss, **kwargs)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/accelerator.py",
 line 2151, in backward
    loss.backward(**kwargs)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/_tensor.py", line 525
, in backward
    torch.autograd.backward(
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/autograd/__init__.py"
, line 267, in backward
    _engine_run_backward(
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/autograd/graph.py", l
ine 744, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backwar
d pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU 
[rank0]: Traceback (most recent call last):
[rank0]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]:   File "<frozen runpy>", line 88, in _run_code
[rank0]:   File "/home/owen/axolotl/src/axolotl/cli/train.py", line 72, in <module>
[rank0]:     fire.Fire(do_cli)
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", lin
e 143, in Fire
[rank0]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", lin
e 477, in _Fire
[rank0]:     component, remaining_args = _CallAndUpdateTrace(
[rank0]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", lin
e 693, in _CallAndUpdateTrace
[rank0]:     component = fn(*varargs, **kwargs)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/axolotl/src/axolotl/cli/train.py", line 39, in do_cli
[rank0]:     return do_train(parsed_cfg, parsed_cli_args)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/axolotl/src/axolotl/cli/train.py", line 67, in do_train
[rank0]:     return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/axolotl/src/axolotl/train.py", line 187, in train
[rank0]:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/train
er.py", line 1938, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/train
er.py", line 2279, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/train
er.py", line 3349, in training_step
[rank0]:     self.accelerator.backward(loss, **kwargs)
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/acceler
ator.py", line 2151, in backward
[rank0]:     loss.backward(**kwargs)
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/_tensor.py",
 line 525, in backward
[rank0]:     torch.autograd.backward(
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/autograd/__i
nit__.py", line 267, in backward
[rank0]:     _engine_run_backward(
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/autograd/gra
ph.py", line 744, in _engine_run_backward
[rank0]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run th
e backward pass
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU 
W0806 07:02:51.202000 127806143026240 torch/distributed/elastic/multiprocessing/api.py:851] Sending
 process 14784 closing signal SIGTERM
E0806 07:02:52.019000 127806143026240 torch/distributed/elastic/multiprocessing/api.py:826] failed 
(exitcode: 1) local_rank: 1 (pid: 14785) of binary: /home/owen/miniconda3/envs/axolotl/bin/python
Traceback (most recent call last):
  File "/home/owen/miniconda3/envs/axolotl/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/commands/acceler
ate_cli.py", line 48, in main
    args.func(args)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/commands/launch.
py", line 1088, in launch_command
    multi_gpu_launcher(args)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/commands/launch.
py", line 733, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/run.py", 
line 870, in run
    elastic_launch(
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/launcher/
api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/launcher/
api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
axolotl.cli.train FAILED 
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-06_07:02:51
  host      : owen-train-pc
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 14785)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
wandb: - 0.042 MB of 0.042 MB uploadedemo-Indo$ wandb: - 0.010 MB of 0.010 MB uploaded
wandb: Run history:
wandb:               eval/loss ▁
wandb:            eval/runtime ▁
wandb: eval/samples_per_second ▁
wandb:   eval/steps_per_second ▁
wandb:             train/epoch ▁▁
wandb:       train/global_step ▁▁
wandb:         train/grad_norm ▁
wandb:     train/learning_rate ▁
wandb:              train/loss ▁
wandb: 
wandb: Run summary:
wandb:               eval/loss 1.78774
wandb:            eval/runtime 482.9937
wandb: eval/samples_per_second 1.035
wandb:   eval/steps_per_second 0.518
wandb:             train/epoch 0.00021
wandb:       train/global_step 1
wandb:         train/grad_norm 1.64656
wandb:     train/learning_rate 0.0
wandb:              train/loss 2.4432
wandb: 
wandb: 🚀 View run indo-formax-v1.0-lora-4096 at: https://wandb.ai/owenarliawan/indo-nemo-12b/runs/
cltbvjh7
wandb: ⭐️ View project at: https://wandb.ai/owenarliawan/indo-nemo-12b
wandb: Synced 6 W&B file(s), 0 media file(s), 1 artifact file(s) and 1 other file(s)
wandb: Find logs at: ./wandb/run-20240806_065242-cltbvjh7/logs
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with `wandb.requir
e("core")`! See https://wandb.me/wandb-core for more information.

Steps to reproduce

Train mistral 12B instruct using FSDP and LORA with 8192 context. Then enable evals and it will fail after the first eval.

I am training Mistral Nemo 12B Instruct with the tokenizersreplaced with the tokenizer from https://huggingface.co/axolotl-ai-co/Mistral-Nemo-Base-2407-chatml so that there are chatml tokens for training.

Config yaml

base_model: /home/user/models/Mistral-Nemo-Instruct-2407
model_type: AutoModelForCausalLM

train_on_inputs: false
group_by_length: false
load_in_8bit: false
load_in_4bit: false
strict: false
sequence_len: 4096
bf16: auto
fp16:
tf32: false
flash_attention: true

shuffle_merged_datasets: false

# Data
datasets:
  - path: /home/user/datasets/pre_train.jsonl
    type: completion
  - path: /home/user/datasets/instruct.jsonl
    type: sharegpt
    conversation: chatml

test_datasets:
  - path: /home/user/datasets_eval.jsonl
    # You need to specify a split. For "json" datasets the default split is called "train".
    split: train
    type: sharegpt
    conversation: chatml

warmup_steps: 10
dataset_prepared_path: ./lora_last_run_prepared

# Iterations
num_epochs: 1
saves_per_epoch: 16
saves_total_limit: 16

# Evaluation
# val_set_size: 0.0025
eval_max_new_tokens: 128
eval_sample_packing: false
evals_per_epoch: 16
eval_table_size:

# LoRA
output_dir: ./lora_out
adapter: lora
lora_model_dir:
lora_r: 64
lora_alpha: 128
lora_dropout: 0.05
lora_target_linear: true

save_safetensors: true

loraplus_lr_ratio: 16
loraplus_lr_embedding: 1e-6

# Sampling
sample_packing: true
pad_to_sequence_len: true

# Batching
gradient_accumulation_steps: 16
micro_batch_size: 1
gradient_checkpointing: unsloth

# wandb
wandb_mode: # "offline" to save run metadata locally and not sync to the server, "disabled" to turn off wandb
wandb_project: indo-nemo-12b
wandb_entity: # A wandb Team name if using a Team
wandb_watch:
wandb_name: indo-formax-v1.0-lora-4096
wandb_run_id: # Set the ID of your wandb run
wandb_log_model: # "checkpoint" to log model to wandb Artifacts every `save_steps` or "end" to log only at the end of training

# Optimizer
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.000005

# Misc
early_stopping_patience:
auto_resume_from_checkpoints: true
logging_steps: 1
debug:

weight_decay: 0.1

special_tokens:
  eos_token: "<|im_end|>"

chat_template: chatml

# Multi-GPU
deepspeed:
fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_limit_all_gathers: true
  fsdp_sync_module_states: true
  fsdp_offload_params: false
  fsdp_use_orig_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: MistralDecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD

Possible solution

No response

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.11

axolotl branch-commit

78b42a3fe13c49e317bc116b9999c30e070322cc

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

Aug 06 '24 17:08 Nero10578

u need load_in_8bit: true with lora adapter i think

Aug 08 '24 16:08 ewof

You'll need to use 4 it qlora for this if you intend to use fsdp. Iirc 8 bit quantization doesn't play well with fsdp.

Aug 08 '24 17:08 winglian

You'll need to use 4 it qlora for this if you intend to use fsdp. Iirc 8 bit quantization doesn't play well with fsdp.

Does that mean using LORA while setting load_in_8bit: false does not work with FSDP? Using LORA loads in 8-bit by default right?

Aug 10 '24 15:08 Nero10578

@Nero10578 can we mix continued pretrain and fine tune at the same time ? as your dataset indicated :

path: /home/user/datasets/pre_train.jsonl type: completion
- path: /home/user/datasets/instruct.jsonl type: sharegpt conversation: chatml

Oct 02 '24 06:10 deter3