trl icon indicating copy to clipboard operation
trl copied to clipboard

When eval_packing=False, training failed with KeyError: 'eval_loss' .

Open AllenShow opened this issue 6 months ago • 0 comments

Thanks for providing the wonderful tool TRL. I have a question. When packing=True, eval_packing=False, training failed with KeyError: 'eval_loss' . However, when I removed eval_packing=False( at this moment, eval_packing will follow packing's value, i.e. True), the training procedure finished successfully. Why? Is this a bug? Could you please solve it? BTW, in the pre-training scenario, I think packing=True, eval_packing=False may be more suitable than packing=True, eval_packing=True , right?

The error log is as follows: File "/mnt/moe-nfs/home/xz/mg/code/train_mg_4diff_cpt_full_para.py", line 164, in <module> #data_collator=data_collator, File "/opt/conda/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 451, in train output = super().train(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1637, in train return inner_training_loop( File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2045, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2440, in _maybe_log_save_evaluate self._save_checkpoint(model, trial, metrics=metrics) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2521, in _save_checkpoint metric_value = metrics[metric_to_check] KeyError: 'eval_loss'

This is my training parameters: ` trainer_args = TrainingArguments( output_dir= model_out_path, per_device_train_batch_size = 16, per_device_eval_batch_size = 16, learning_rate= 2e-5, lr_scheduler_type="cosine", #warmup_steps=500, warmup_ratio=0.1, #save_total_limit=10, num_train_epochs=2, save_steps= 250, eval_steps= 250, logging_strategy="steps", evaluation_strategy="steps", logging_steps= 25, gradient_accumulation_steps=4, #gradient_checkpointing=True, #gradient_checkpointing_kwargs={'use_reentrant':False}, bf16=True, do_train=True, do_eval=True, metric_for_best_model="eval_loss", greater_is_better=False, load_best_model_at_end=False, seed=SEED, report_to="wandb" )

trainer = SFTTrainer( model, args= trainer_args, train_dataset= dataset_train, eval_dataset= dataset_test, dataset_text_field="text", max_seq_length=1024, #formatting_func=formatting_prompts_func, #data_collator=data_collator, tokenizer=tokenizer, packing=True, #eval_packing=False ) `

AllenShow avatar Aug 07 '24 07:08 AllenShow