FlagAI icon indicating copy to clipboard operation
FlagAI copied to clipboard

Adafactor is not a supported DeepSpeed Optimizer

Open zhihao-chen opened this issue 2 years ago • 2 comments

System Info

我按照example中的finetune huggingface t5_11b的示例在其它数据集中执行,但是出现以下错误。

/root/work2/work2/chenzhihao/DeepSpeed/deepspeed/runtime/engine.py:1008 in _do_sanity_check │ │ │ │ 1005 │ │ │ │ 1006 │ │ if not self.client_optimizer: │ │ 1007 │ │ │ if self.optimizer_name() is not None: │ │ ❱ 1008 │ │ │ │ assert self._is_supported_optimizer( │ │ 1009 │ │ │ │ │ self.optimizer_name()), "{} is not a supported DeepSpeed Optimizer". │ │ 1010 │ │ │ │ 1011 │ │ if (self.optimizer_name() == LAMB_OPTIMIZER or self.optimizer_name() == ONEBIT_L │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ AssertionError: Adafactor is not a supported DeepSpeed Optimizer 是不是需要修改DeepSpeed/deepspeed/runtime/engine.py文件

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as T5/AltCLIP, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

trainer = MyTrainer(env_type='deepspeed', tokenizer=tokenizer, epochs=config['num_train_epochs'], batch_size=config['train_batch_size'], gradient_accumulation_steps=config['gradient_accumulation_steps'], eval_interval=1000, log_interval=1000, save_interval=1000, save_dir=config['output_dir'], experiment_name='t5_11b', load_dir=None, lr=1e-4, fp16=True, master_ip='127.0.0.1', master_port=17755, num_nodes=1, num_gpus=3, hostfile='./hostfile', model_parallel_size=1, deepspeed_config='../config/deepspeed.json', training_script=file) trainer.train(model, train_dataset=train_dataset, valid_dataset=eval_dataset, collate_fn=t5_seq2seq_collate_fn, metric_methods=[bleu_metric, rouge_metric])

Expected behavior

zhihao-chen avatar Apr 03 '23 06:04 zhihao-chen

Sorry that currently in Flagai, self-defined optimizer is not supported with Deepspeeed

BAAI-OpenPlatform avatar Apr 04 '23 02:04 BAAI-OpenPlatform

ok,但我现在又遇到另一个问题: │ /root/work2/work2/chenzhihao/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py:673 in │ │ init │ │ │ │ 670 │ │ # If we are provided an already-allocated module to prepare. │ │ 671 │ │ if module is not None: │ │ 672 │ │ │ assert isinstance(module, torch.nn.Module) │ │ ❱ 673 │ │ │ self._convert_to_zero_parameters(module.parameters(recurse=True)) │ │ 674 │ │ │ │ 675 │ │ self.use_all_gather_into_tensor = dist.has_all_gather_into_tensor() │ │ 676 │ │ if not self.use_all_gather_into_tensor: │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ TypeError: parameters() got an unexpected keyword argument 'recurse'

deepspeed=0.8.3 deepspeed_config: { "train_micro_batch_size_per_gpu": 8, "gradient_accumulation_steps": 32, "steps_per_print": 500, "gradient_clipping": 1.0, "zero_optimization": { "stage": 3, "contiguous_gradients": false, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 5e7, "allgather_bucket_size": 5e7, "cpu_offload": true }, "zero_allow_untested_optimizer": true, "fp16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "Adam", "params": { "lr": 0.0004, "weight_decay": 0.01, "betas": [ 0.9, 0.98 ], "eps": 1e-6 } }, "activation_checkpointing": { "partition_activations": false, "contiguous_memory_optimization": false }, "wall_clock_breakdown": false }

zhihao-chen avatar Apr 06 '23 03:04 zhihao-chen

先关闭,如有问题重新打开issue,谢谢

ftgreat avatar Jun 22 '23 11:06 ftgreat