axolotl
axolotl copied to clipboard
Unable to finetune Mistral-7B with DeepSpeed Zero3
Please check that this issue hasn't been reported before.
- [X] I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
I'm currently testing if I can fit the sequence_len so at the very least it should give an OOM. I have tried changing the sample_packing to false but it just returns a different error.
My setup is not ideal since I'm using Pytorch with CUDA 11.7 and BitsAndBytes with CUDA 12.2 (the version of my GPUs driver).
Here is my pip list (filtered)
deepspeed 0.12.6
flash-attn 2.3.3
optimum 1.13.2
safetensors 0.4.1
tokenizers 0.15.0
torch 2.0.1+cu117
xformers 0.0.22
Current behaviour
I obtain the following errors:
sample_packing=true
...
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [8926,0,0], thread: [58,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` fa
iled.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [8926,0,0], thread: [59,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` fa
iled.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [8926,0,0], thread: [60,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` fa
iled.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [8926,0,0], thread: [61,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` fa
iled.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [8926,0,0], thread: [62,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` fa
iled.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [8926,0,0], thread: [63,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` fa
iled.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 175956 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 175957 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 175958 closing signal SIGTERM
wandb: WARNING No program path found, not creating job artifact. See https://docs.wandb.ai/guides/launch/create-job
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 3 (pid: 175959) of binary: /.../.venv/bin/python
Traceback (most recent call last):
File "/..../.venv/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/.../.venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/.../.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
deepspeed_launcher(args)
File "/.../.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 724, in deepspeed_launcher
distrib_run.run(args)
File "/.../.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/.../.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/.../.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError
sample_packing=false
Traceback (most recent call last):
File "/opt/miniconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/miniconda3/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/.../axolotl/src/axolotl/cli/train.py", line 42, in <module>
fire.Fire(do_cli)
File "/.../.venv/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/.../.venv/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/.../.venv/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/.../axolotl/src/axolotl/cli/train.py", line 38, in do_cli
train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
File "/.../axolotl/src/axolotl/train.py", line 142, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/.../.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1543, in train
return inner_training_loop(
File "/.../.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/.../.venv/lib/python3.10/site-packages/transformers/trainer.py", line 2746, in training_step
self.accelerator.backward(loss)
File "/.../.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1958, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/.../.venv/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
self.engine.backward(loss, **kwargs)
File "/.../.venv/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/.../.venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1955, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/.../.venv/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/.../.venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2135, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/.../.venv/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/.../.venv/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/.../.venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: The size of tensor a (0) must match the size of tensor b (14336) at non-singleton dimension 1
Steps to reproduce
I have created a accelerate config file:
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_config_file: deepspeed/zero3_bf16.json
zero3_init_flag: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
And I'm running the following command:
accelerate launch --config_file accelerate_config/multi_gpu_config.yaml -m axolotl.cli.train mistral-7b-instruct-v0.2-mullti-gpu.yaml
### Config yaml
```yaml
base_model: mistralai/Mistral-7B-Instruct-v0.2
model_type: MistralForCausalLM
model_config:
sliding_window: 4096
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true
load_in_8bit: false
load_in_4bit: false
strict: false
datasets:
- path: dataset.parquet
type: alpaca
dataset_prepared_path:
val_set_size: 0.05
output_dir: out
sequence_len: 1024
sample_packing: false
pad_to_sequence_len: true
eval_sample_packing: false
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 3
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.000005
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
eval_table_max_new_tokens: 128
saves_per_epoch: 0.5
debug:
deepspeed: deepspeed/zero3_bf16.json
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
bos_token: "<s>"
eos_token: "</s>"
unk_token: "<unk>"
Possible solution
No response
Which Operating Systems are you using?
- [X] Linux
- [ ] macOS
- [ ] Windows
Python Version
3.10
axolotl branch-commit
main/d69ba2b0b76fad112acecd5a1fbb339e6244ff7b
Acknowledgements
- [X] My issue title is concise, descriptive, and in title casing.
- [X] I have searched the existing issues to make sure this bug has not been reported yet.
- [X] I am using the latest version of axolotl.
- [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.
Update: Increasing the micro_batch_size to 8 seems to do the trick, however, I'm now wondering if it should be possible to use smaller batch sizes?
I have same issue when i'm using sample_packing: false, micro_batch_size: 1.
This looks like a dupe of https://github.com/OpenAccess-AI-Collective/axolotl/issues/1092
what GPU are you using?
I'm using 4x A100 80GB