FlagAI
FlagAI copied to clipboard
Adafactor is not a supported DeepSpeed Optimizer
System Info
我按照example中的finetune huggingface t5_11b的示例在其它数据集中执行,但是出现以下错误。
/root/work2/work2/chenzhihao/DeepSpeed/deepspeed/runtime/engine.py:1008 in _do_sanity_check │ │ │ │ 1005 │ │ │ │ 1006 │ │ if not self.client_optimizer: │ │ 1007 │ │ │ if self.optimizer_name() is not None: │ │ ❱ 1008 │ │ │ │ assert self._is_supported_optimizer( │ │ 1009 │ │ │ │ │ self.optimizer_name()), "{} is not a supported DeepSpeed Optimizer". │ │ 1010 │ │ │ │ 1011 │ │ if (self.optimizer_name() == LAMB_OPTIMIZER or self.optimizer_name() == ONEBIT_L │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ AssertionError: Adafactor is not a supported DeepSpeed Optimizer 是不是需要修改DeepSpeed/deepspeed/runtime/engine.py文件
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as T5/AltCLIP, ...) - [ ] My own task or dataset (give details below)
Reproduction
trainer = MyTrainer(env_type='deepspeed', tokenizer=tokenizer, epochs=config['num_train_epochs'], batch_size=config['train_batch_size'], gradient_accumulation_steps=config['gradient_accumulation_steps'], eval_interval=1000, log_interval=1000, save_interval=1000, save_dir=config['output_dir'], experiment_name='t5_11b', load_dir=None, lr=1e-4, fp16=True, master_ip='127.0.0.1', master_port=17755, num_nodes=1, num_gpus=3, hostfile='./hostfile', model_parallel_size=1, deepspeed_config='../config/deepspeed.json', training_script=file) trainer.train(model, train_dataset=train_dataset, valid_dataset=eval_dataset, collate_fn=t5_seq2seq_collate_fn, metric_methods=[bleu_metric, rouge_metric])
Expected behavior
无
Sorry that currently in Flagai, self-defined optimizer is not supported with Deepspeeed
ok,但我现在又遇到另一个问题: │ /root/work2/work2/chenzhihao/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py:673 in │ │ init │ │ │ │ 670 │ │ # If we are provided an already-allocated module to prepare. │ │ 671 │ │ if module is not None: │ │ 672 │ │ │ assert isinstance(module, torch.nn.Module) │ │ ❱ 673 │ │ │ self._convert_to_zero_parameters(module.parameters(recurse=True)) │ │ 674 │ │ │ │ 675 │ │ self.use_all_gather_into_tensor = dist.has_all_gather_into_tensor() │ │ 676 │ │ if not self.use_all_gather_into_tensor: │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ TypeError: parameters() got an unexpected keyword argument 'recurse'
deepspeed=0.8.3 deepspeed_config: { "train_micro_batch_size_per_gpu": 8, "gradient_accumulation_steps": 32, "steps_per_print": 500, "gradient_clipping": 1.0, "zero_optimization": { "stage": 3, "contiguous_gradients": false, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 5e7, "allgather_bucket_size": 5e7, "cpu_offload": true }, "zero_allow_untested_optimizer": true, "fp16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "Adam", "params": { "lr": 0.0004, "weight_decay": 0.01, "betas": [ 0.9, 0.98 ], "eps": 1e-6 } }, "activation_checkpointing": { "partition_activations": false, "contiguous_memory_optimization": false }, "wall_clock_breakdown": false }
先关闭,如有问题重新打开issue,谢谢