axolotl
axolotl copied to clipboard
Mixtral 8x7B full finetune with DS zero3: Assertion error
Please check that this issue hasn't been reported before.
- [X] I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
That the model can start training after the DeepSpeed fix on main.
Current behaviour
The model loads and does not OOM, but DeepSpeed raises an assertion on checking that the datatype is the same for all tensors:
assert len(set(t.dtype for t in tensors)) == 1
Traceback
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
return _run_code(code, main_globals, None,exec(code, run_globals)
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals) File "/axolotl/src/axolotl/cli/train.py", line 38, in <module>
File "/axolotl/src/axolotl/cli/train.py", line 38, in <module>
exec(code, run_globals)fire.Fire(do_cli)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
File "/axolotl/src/axolotl/cli/train.py", line 38, in <module>
return _run_code(code, main_globals, None,fire.Fire(do_cli)
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
return _run_code(code, main_globals, None, File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
Traceback (most recent call last):
fire.Fire(do_cli)
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
File "/axolotl/src/axolotl/cli/train.py", line 38, in <module>
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
Traceback (most recent call last):
exec(code, run_globals) component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
fire.Fire(do_cli)
File "/axolotl/src/axolotl/cli/train.py", line 38, in <module>
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace( File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
fire.Fire(do_cli)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
Traceback (most recent call last):
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,component, remaining_args = _CallAndUpdateTrace(
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component_trace = _Fire(component, args, parsed_flag_args, context, name)
component = fn(*varargs, **kwargs)component = fn(*varargs, **kwargs) File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
exec(code, run_globals) File "/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
return _run_code(code, main_globals, None, File "/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
component, remaining_args = _CallAndUpdateTrace(
File "/axolotl/src/axolotl/cli/train.py", line 38, in <module>
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
component = fn(*varargs, **kwargs) File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)component, remaining_args = _CallAndUpdateTrace( File "/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
File "/axolotl/src/axolotl/train.py", line 129, in train
File "/axolotl/src/axolotl/train.py", line 129, in train
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
fire.Fire(do_cli)exec(code, run_globals)
return _run_code(code, main_globals, None, File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta) File "/axolotl/src/axolotl/cli/train.py", line 38, in <module>
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
trainer.train(resume_from_checkpoint=resume_from_checkpoint)trainer.train(resume_from_checkpoint=resume_from_checkpoint)component = fn(*varargs, **kwargs) File "/axolotl/src/axolotl/train.py", line 129, in train
component = fn(*varargs, **kwargs)
File "/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1540, in train
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1540, in train
component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
fire.Fire(do_cli)trainer.train(resume_from_checkpoint=resume_from_checkpoint)
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta) File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1540, in train
File "/axolotl/src/axolotl/cli/train.py", line 38, in <module>
train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta) File "/axolotl/src/axolotl/train.py", line 129, in train
File "/axolotl/src/axolotl/train.py", line 129, in train
fire.Fire(do_cli)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
component, remaining_args = _CallAndUpdateTrace( trainer.train(resume_from_checkpoint=resume_from_checkpoint)component_trace = _Fire(component, args, parsed_flag_args, context, name)
trainer.train(resume_from_checkpoint=resume_from_checkpoint) File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1540, in train
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1540, in train
component_trace = _Fire(component, args, parsed_flag_args, context, name)return inner_training_loop(
return inner_training_loop( File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1678, in _inner_training_loop
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1678, in _inner_training_loop
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
return inner_training_loop(
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1678, in _inner_training_loop
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
File "/axolotl/src/axolotl/train.py", line 129, in train
component = fn(*varargs, **kwargs)trainer.train(resume_from_checkpoint=resume_from_checkpoint)return inner_training_loop(
File "/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1540, in train
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1678, in _inner_training_loop
return inner_training_loop(
component = fn(*varargs, **kwargs) File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1678, in _inner_training_loop
train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta) File "/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare( File "/axolotl/src/axolotl/train.py", line 129, in train
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1284, in prepare
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta) File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1284, in prepare
File "/axolotl/src/axolotl/train.py", line 129, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1540, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1540, in train
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1284, in prepare
return inner_training_loop(
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1678, in _inner_training_loop
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1284, in prepare
result = self._prepare_deepspeed(*args) model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
result = self._prepare_deepspeed(*args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1666, in _prepare_deepspeed
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1284, in prepare
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1666, in _prepare_deepspeed
return inner_training_loop(
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1678, in _inner_training_loop
result = self._prepare_deepspeed(*args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1666, in _prepare_deepspeed
return inner_training_loop(
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1678, in _inner_training_loop
result = self._prepare_deepspeed(*args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1666, in _prepare_deepspeed
result = self._prepare_deepspeed(*args)
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare( File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1666, in _prepare_deepspeed
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1284, in prepare
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 171, in initialize
File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 304, in __init__
engine = DeepSpeedEngine(args=args,
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)model, self.optimizer, self.lr_scheduler = self.accelerator.prepare( File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 304, in __init__
File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 171, in initialize
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1284, in prepare
self._configure_optimizer(optimizer, model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1225, in _configure_optimizer
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)result = self._prepare_deepspeed(*args)
engine = DeepSpeedEngine(args=args,model, self.optimizer, self.lr_scheduler = self.accelerator.prepare( File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 171, in initialize
self._configure_optimizer(optimizer, model_parameters) File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1666, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 304, in __init__
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1225, in _configure_optimizer
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1284, in prepare
File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 304, in __init__
result = self._prepare_deepspeed(*args)self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1666, in _prepare_deepspeed
self._configure_optimizer(optimizer, model_parameters) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1552, in _configure_zero_optimizer
engine = DeepSpeedEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1225, in _configure_optimizer
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 304, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1225, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)self._configure_optimizer(optimizer, model_parameters)
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1552, in _configure_zero_optimizer
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1225, in _configure_optimizer
result = self._prepare_deepspeed(*args)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 171, in initialize
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1666, in _prepare_deepspeed
self.optimizer = self._configure_zero_optimizer(basic_optimizer)engine = DeepSpeedEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1552, in _configure_zero_optimizer
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 304, in __init__
optimizer = DeepSpeedZeroOptimizer_Stage3(self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 314, in __init__
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1552, in _configure_zero_optimizer
File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 171, in initialize
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1552, in _configure_zero_optimizer
self._configure_optimizer(optimizer, model_parameters)
optimizer = DeepSpeedZeroOptimizer_Stage3(engine = DeepSpeedEngine(args=args, File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1225, in _configure_optimizer
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 304, in __init__
self._create_fp16_partitions_with_defragmentation(self.trainable_param_groups) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 314, in __init__
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 687, in _create_fp16_partitions_with_defragmentation
File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 171, in initialize
optimizer = DeepSpeedZeroOptimizer_Stage3(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 314, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1225, in _configure_optimizer
engine = DeepSpeedEngine(args=args, optimizer = DeepSpeedZeroOptimizer_Stage3(
self._create_fp16_partitions_with_defragmentation(self.trainable_param_groups) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 304, in __init__
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 314, in __init__
self._create_fp16_partitions_with_defragmentation(self.trainable_param_groups) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 687, in _create_fp16_partitions_with_defragmentation
optimizer = DeepSpeedZeroOptimizer_Stage3(
device_buffer = __class__.defragment(parameter_partitions)self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 687, in _create_fp16_partitions_with_defragmentation
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 314, in __init__
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1552, in _configure_zero_optimizer
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 522, in defragment
self._configure_optimizer(optimizer, model_parameters)self._create_fp16_partitions_with_defragmentation(self.trainable_param_groups)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1225, in _configure_optimizer
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 687, in _create_fp16_partitions_with_defragmentation
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1552, in _configure_zero_optimizer
device_buffer = __class__.defragment(parameter_partitions)self._create_fp16_partitions_with_defragmentation(self.trainable_param_groups)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 522, in defragment
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 687, in _create_fp16_partitions_with_defragmentation
assert len(set(t.dtype for t in tensors)) == 1device_buffer = __class__.defragment(parameter_partitions)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 522, in defragment
AssertionError
device_buffer = __class__.defragment(parameter_partitions)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 522, in defragment
assert len(set(t.dtype for t in tensors)) == 1
AssertionError optimizer = DeepSpeedZeroOptimizer_Stage3( self.optimizer = self._configure_zero_optimizer(basic_optimizer)assert len(set(t.dtype for t in tensors)) == 1
device_buffer = __class__.defragment(parameter_partitions)
AssertionErrorassert len(set(t.dtype for t in tensors)) == 1
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 314, in __init__
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1552, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer_Stage3(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 522, in defragment
AssertionError File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 314, in __init__
self._create_fp16_partitions_with_defragmentation(self.trainable_param_groups)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 687, in _create_fp16_partitions_with_defragmentation
self._create_fp16_partitions_with_defragmentation(self.trainable_param_groups)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 687, in _create_fp16_partitions_with_defragmentation
assert len(set(t.dtype for t in tensors)) == 1
AssertionError
device_buffer = __class__.defragment(parameter_partitions)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 522, in defragment
optimizer = DeepSpeedZeroOptimizer_Stage3(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 314, in __init__
device_buffer = __class__.defragment(parameter_partitions)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 522, in defragment
assert len(set(t.dtype for t in tensors)) == 1self._create_fp16_partitions_with_defragmentation(self.trainable_param_groups)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 687, in _create_fp16_partitions_with_defragmentation
AssertionError
assert len(set(t.dtype for t in tensors)) == 1
AssertionError
device_buffer = __class__.defragment(parameter_partitions)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 522, in defragment
assert len(set(t.dtype for t in tensors)) == 1
AssertionError
Steps to reproduce
Reuse the config that I have provided and load the model on 8x A100.
Config yaml
base_model: mistralai/Mixtral-8x7B-v0.1
model_type: AutoModelForCausalLM
tokenizer_type: LlamaTokenizer
trust_remote_code: true
# loss is high without this
model_config:
output_router_logits: true
load_in_8bit: false
load_in_4bit: false
strict: false
datasets:
- path: <your_data>
dataset_prepared_path:
val_set_size: 0.1
output_dir: /workspace
adapter:
lora_model_dir:
sequence_len: 32768
sample_packing: true
pad_to_sequence_len: true
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 3
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0005
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3
warmup_ratio: 0.1
evals_per_epoch: 4
eval_table_size:
eval_table_max_new_tokens:
saves_per_epoch: 1
debug:
deepspeed: zero3.json
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:
bos_token: "<s>"
eos_token: <|im_end|>"
unk_token: "<unk>"
tokens:
- "<|im_start|>"
- "<|im_end|>"
Possible solution
No response
Which Operating Systems are you using?
- [X] Linux
- [ ] macOS
- [ ] Windows
Python Version
3.8
axolotl branch-commit
main
Acknowledgements
- [X] My issue title is concise, descriptive, and in title casing.
- [X] I have searched the existing issues to make sure this bug has not been reported yet.
- [X] I am using the latest version of axolotl.
- [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.
I have faced hang issues after 1:30 hours training time wiht ft and zero3
With the same config I get OOM while training on 5 x nodes with 8 x H100 each.
Any configs other than the example 4-bit qlora I have tried results in a OOM or some other error.
[2023-12-18 00:52:30,840] [ERROR] [axolotl.load_model:453] [PID:99] [RANK:7] CUDA out of memory. Tried to allocate 112.00 MiB (GPU 7; 79.11 GiB total capacity; 78.12 GiB already allocated; 40.62 MiB free; 78.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Cody
I have faced hang issues after 1:30 hours training time wiht ft and zero3
same question
I have faced hang issues after 1:30 hours training time wiht ft and zero3
same question
u can try to update nccl 2.19.3
Any updates on this error? I am seeing the same thing with Llama-v2 full finetune using zero3.
I think this was solved by setting bf16 from auto to true instead in your deepspeed config
Does anyone still have this issue after trying casper's suggestion?