Mistral Nemo 12B training CUDA Out of memory only when enabling EVAL. On 2x3090Ti FSDP.
Please check that this issue hasn't been reported before.
- [X] I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
Eval or no eval shouldn't cause a difference in memory use and cause the training to fail when eval is enabled.
Current behaviour
I can start and train Mistral Nemo 12B just fine but it crashes when returning to training after an eval. If I disable eval entirely then the training works just fine.
This is the result of training Mistral Nemo 12B Instruct when enabling eval. This error comes up after going back into training after finishing the first eval,
packing_efficiency_estimate: 0.89 total_num_tokens per device: 277808004
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__ca
ll__` method is faster than using a method to encode the text followed by a call to the `pad` metho
d to get a padded encoding.
{'loss': 2.4432, 'grad_norm': 1.6465606689453125, 'learning_rate': 5.000000000000001e-07, 'epoch':
0.0}
0%| | 1/4715 [01:50<145:19:23, 110.98s/it[
2024-08-06 07:02:40,567] [INFO] [accelerate.accelerator.gather_for_metrics:2406] [PID:14784] The u$
ed dataset had no length, returning gathered tensors. You should drop the remainder yourself.
{'eval_loss': 1.7877388000488281, 'eval_runtime': 482.9937, 'eval_samples_per_second': 1.035, 'eval
_steps_per_second': 0.518, 'epoch': 0.0}
0%| | 1/4715 [09:53<145:19:23, 110.98s/itT
raceback (most recent call last):
[rank1]: Traceback (most recent call last):
[rank1]: File "<frozen runpy>", line 198, in _run_module_as_main
[rank1]: File "<frozen runpy>", line 88, in _run_code
[rank1]: File "/home/owen/axolotl/src/axolotl/cli/train.py", line 72, in <module>
[rank1]: fire.Fire(do_cli)
[rank1]: File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", lin
e 143, in Fire
[rank1]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", lin
e 477, in _Fire
[rank1]: component, remaining_args = _CallAndUpdateTrace(
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", lin
e 693, in _CallAndUpdateTrace
[rank1]: component = fn(*varargs, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/owen/axolotl/src/axolotl/cli/train.py", line 39, in do_cli
[rank1]: return do_train(parsed_cfg, parsed_cli_args)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/owen/axolotl/src/axolotl/cli/train.py", line 67, in do_train
[rank1]: return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/owen/axolotl/src/axolotl/train.py", line 187, in train
[rank1]: trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank1]: File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/train
er.py", line 1938, in train
[rank1]: return inner_training_loop(
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/train
er.py", line 2279, in _inner_training_loop
[rank1]: tr_loss_step = self.training_step(model, inputs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/train
er.py", line 3349, in training_step
[rank1]: self.accelerator.backward(loss, **kwargs)
[rank1]: File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/accelerator.py", line 2151, in backward
[rank1]: loss.backward(**kwargs)
[rank1]: File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/_tensor.py",
line 525, in backward
[rank1]: torch.autograd.backward(
[rank1]: File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/autograd/__i
nit__.py", line 267, in backward
[rank1]: _engine_run_backward(
[rank1]: File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/autograd/gra
ph.py", line 744, in _engine_run_backward
[rank1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run th
e backward pass
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^
[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU has a to
tal capacity of 23.67 GiB of which 1.59 GiB is free. Including non-PyTorch memory, this process has
22.06 GiB memory in use. Of the allocated memory 19.75 GiB is allocated by PyTorch, and 1.78 GiB i
s reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTO
RCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory
Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/owen/axolotl/src/axolotl/cli/train.py", line 72, in <module>
fire.Fire(do_cli)
File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 143, in
Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 477, in
_Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 693, in
_CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/owen/axolotl/src/axolotl/cli/train.py", line 39, in do_cli
return do_train(parsed_cfg, parsed_cli_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/owen/axolotl/src/axolotl/cli/train.py", line 67, in do_train
return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/owen/axolotl/src/axolotl/train.py", line 187, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", l
ine 1938, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", l
ine 2279, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", l
ine 3349, in training_step
self.accelerator.backward(loss, **kwargs)
File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/accelerator.py",
line 2151, in backward
loss.backward(**kwargs)
File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/_tensor.py", line 525
, in backward
torch.autograd.backward(
File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/autograd/__init__.py"
, line 267, in backward
_engine_run_backward(
File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/autograd/graph.py", l
ine 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backwar
d pass
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU
[rank0]: Traceback (most recent call last):
[rank0]: File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]: File "<frozen runpy>", line 88, in _run_code
[rank0]: File "/home/owen/axolotl/src/axolotl/cli/train.py", line 72, in <module>
[rank0]: fire.Fire(do_cli)
[rank0]: File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", lin
e 143, in Fire
[rank0]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", lin
e 477, in _Fire
[rank0]: component, remaining_args = _CallAndUpdateTrace(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", lin
e 693, in _CallAndUpdateTrace
[rank0]: component = fn(*varargs, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/owen/axolotl/src/axolotl/cli/train.py", line 39, in do_cli
[rank0]: return do_train(parsed_cfg, parsed_cli_args)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/owen/axolotl/src/axolotl/cli/train.py", line 67, in do_train
[rank0]: return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/owen/axolotl/src/axolotl/train.py", line 187, in train
[rank0]: trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank0]: File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/train
er.py", line 1938, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/train
er.py", line 2279, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/train
er.py", line 3349, in training_step
[rank0]: self.accelerator.backward(loss, **kwargs)
[rank0]: File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/acceler
ator.py", line 2151, in backward
[rank0]: loss.backward(**kwargs)
[rank0]: File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/_tensor.py",
line 525, in backward
[rank0]: torch.autograd.backward(
[rank0]: File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/autograd/__i
nit__.py", line 267, in backward
[rank0]: _engine_run_backward(
[rank0]: File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/autograd/gra
ph.py", line 744, in _engine_run_backward
[rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run th
e backward pass
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU
W0806 07:02:51.202000 127806143026240 torch/distributed/elastic/multiprocessing/api.py:851] Sending
process 14784 closing signal SIGTERM
E0806 07:02:52.019000 127806143026240 torch/distributed/elastic/multiprocessing/api.py:826] failed
(exitcode: 1) local_rank: 1 (pid: 14785) of binary: /home/owen/miniconda3/envs/axolotl/bin/python
Traceback (most recent call last):
File "/home/owen/miniconda3/envs/axolotl/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/commands/acceler
ate_cli.py", line 48, in main
args.func(args)
File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/commands/launch.
py", line 1088, in launch_command
multi_gpu_launcher(args)
File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/commands/launch.
py", line 733, in multi_gpu_launcher
distrib_run.run(args)
File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/run.py",
line 870, in run
elastic_launch(
File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/launcher/
api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/launcher/
api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
axolotl.cli.train FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-08-06_07:02:51
host : owen-train-pc
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 14785)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
wandb: - 0.042 MB of 0.042 MB uploadedemo-Indo$ wandb: - 0.010 MB of 0.010 MB uploaded
wandb: Run history:
wandb: eval/loss ▁
wandb: eval/runtime ▁
wandb: eval/samples_per_second ▁
wandb: eval/steps_per_second ▁
wandb: train/epoch ▁▁
wandb: train/global_step ▁▁
wandb: train/grad_norm ▁
wandb: train/learning_rate ▁
wandb: train/loss ▁
wandb:
wandb: Run summary:
wandb: eval/loss 1.78774
wandb: eval/runtime 482.9937
wandb: eval/samples_per_second 1.035
wandb: eval/steps_per_second 0.518
wandb: train/epoch 0.00021
wandb: train/global_step 1
wandb: train/grad_norm 1.64656
wandb: train/learning_rate 0.0
wandb: train/loss 2.4432
wandb:
wandb: 🚀 View run indo-formax-v1.0-lora-4096 at: https://wandb.ai/owenarliawan/indo-nemo-12b/runs/
cltbvjh7
wandb: ⭐️ View project at: https://wandb.ai/owenarliawan/indo-nemo-12b
wandb: Synced 6 W&B file(s), 0 media file(s), 1 artifact file(s) and 1 other file(s)
wandb: Find logs at: ./wandb/run-20240806_065242-cltbvjh7/logs
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with `wandb.requir
e("core")`! See https://wandb.me/wandb-core for more information.
Steps to reproduce
Train mistral 12B instruct using FSDP and LORA with 8192 context. Then enable evals and it will fail after the first eval.
I am training Mistral Nemo 12B Instruct with the tokenizersreplaced with the tokenizer from https://huggingface.co/axolotl-ai-co/Mistral-Nemo-Base-2407-chatml so that there are chatml tokens for training.
Config yaml
base_model: /home/user/models/Mistral-Nemo-Instruct-2407
model_type: AutoModelForCausalLM
train_on_inputs: false
group_by_length: false
load_in_8bit: false
load_in_4bit: false
strict: false
sequence_len: 4096
bf16: auto
fp16:
tf32: false
flash_attention: true
shuffle_merged_datasets: false
# Data
datasets:
- path: /home/user/datasets/pre_train.jsonl
type: completion
- path: /home/user/datasets/instruct.jsonl
type: sharegpt
conversation: chatml
test_datasets:
- path: /home/user/datasets_eval.jsonl
# You need to specify a split. For "json" datasets the default split is called "train".
split: train
type: sharegpt
conversation: chatml
warmup_steps: 10
dataset_prepared_path: ./lora_last_run_prepared
# Iterations
num_epochs: 1
saves_per_epoch: 16
saves_total_limit: 16
# Evaluation
# val_set_size: 0.0025
eval_max_new_tokens: 128
eval_sample_packing: false
evals_per_epoch: 16
eval_table_size:
# LoRA
output_dir: ./lora_out
adapter: lora
lora_model_dir:
lora_r: 64
lora_alpha: 128
lora_dropout: 0.05
lora_target_linear: true
save_safetensors: true
loraplus_lr_ratio: 16
loraplus_lr_embedding: 1e-6
# Sampling
sample_packing: true
pad_to_sequence_len: true
# Batching
gradient_accumulation_steps: 16
micro_batch_size: 1
gradient_checkpointing: unsloth
# wandb
wandb_mode: # "offline" to save run metadata locally and not sync to the server, "disabled" to turn off wandb
wandb_project: indo-nemo-12b
wandb_entity: # A wandb Team name if using a Team
wandb_watch:
wandb_name: indo-formax-v1.0-lora-4096
wandb_run_id: # Set the ID of your wandb run
wandb_log_model: # "checkpoint" to log model to wandb Artifacts every `save_steps` or "end" to log only at the end of training
# Optimizer
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.000005
# Misc
early_stopping_patience:
auto_resume_from_checkpoints: true
logging_steps: 1
debug:
weight_decay: 0.1
special_tokens:
eos_token: "<|im_end|>"
chat_template: chatml
# Multi-GPU
deepspeed:
fsdp:
- full_shard
- auto_wrap
fsdp_config:
fsdp_limit_all_gathers: true
fsdp_sync_module_states: true
fsdp_offload_params: false
fsdp_use_orig_params: false
fsdp_cpu_ram_efficient_loading: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: MistralDecoderLayer
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_sharding_strategy: FULL_SHARD
Possible solution
No response
Which Operating Systems are you using?
- [X] Linux
- [ ] macOS
- [ ] Windows
Python Version
3.11
axolotl branch-commit
78b42a3fe13c49e317bc116b9999c30e070322cc
Acknowledgements
- [X] My issue title is concise, descriptive, and in title casing.
- [X] I have searched the existing issues to make sure this bug has not been reported yet.
- [X] I am using the latest version of axolotl.
- [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.
u need load_in_8bit: true with lora adapter i think
You'll need to use 4 it qlora for this if you intend to use fsdp. Iirc 8 bit quantization doesn't play well with fsdp.
You'll need to use 4 it qlora for this if you intend to use fsdp. Iirc 8 bit quantization doesn't play well with fsdp.
Does that mean using LORA while setting load_in_8bit: false does not work with FSDP? Using LORA loads in 8-bit by default right?
@Nero10578 can we mix continued pretrain and fine tune at the same time ? as your dataset indicated :
- path: /home/user/datasets/pre_train.jsonl
type: completion
- path: /home/user/datasets/instruct.jsonl type: sharegpt conversation: chatml