transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Trainer is attempting to log a torch.Tensor but MLflow's only accepts float.

Open etiennebonnafoux opened this issue 11 months ago • 2 comments

System Info

When I finetune a LLM model (Mistral-7B) I got a very explicit error when the the trainer log into MlFlow

Trainer is attempting to log a value of "0.528405487537384" of type <class 'torch.Tensor'> for key "grad_norm" as a metric. MLflow's log_metric() only accepts float and int types so we dropped this attribute.

I do not know how to say to the MlFlow callback to convert torch.Tensor to float before logging.

Who can help?

@sanchit-gandhi @muellerz @pacman100

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

Unfortunately, I cannot share neither the model or the dataset since they do not belong to me.

Expected behavior

Convert torch.Tensor to float before logging.

etiennebonnafoux avatar Mar 21 '24 17:03 etiennebonnafoux

Adding

           elif isinstance(v, torch.Tensor) and v.numel() == 1:
                   metrics[k] = float(v)

to the on_log fonction of the MlFlowCallback code solve it.

etiennebonnafoux avatar Mar 21 '24 17:03 etiennebonnafoux

Hi @etiennebonnafoux, thanks for reporting!

Integrations, like MLFlow aren't actively maintained by us - rather the contributors who added them. We do want them to work, however! Would you like to open a PR with this fix? This way you get the github contribution for your suggestion

amyeroberts avatar Mar 22 '24 12:03 amyeroberts

This issue is closed with https://github.com/huggingface/transformers/pull/29932

etiennebonnafoux avatar Apr 10 '24 19:04 etiennebonnafoux