datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Fix label renaming and add a battery of tests

Open Rocketknight1 opened this issue 2 years ago • 1 comments

This PR makes some changes to label renaming in to_tf_dataset(), both to fix some issues when users input something we weren't expecting, and also to make it easier to deprecate label renaming in future, if/when we want to move this special-casing logic to a function in transformers.

The main changes are:

  • Label renaming now only happens when the auto_rename_labels argument is set. For backward compatibility, this defaults to True for now.
  • If the user requests "label" but the data collator renames that column to "labels", the label renaming logic will now handle that case correctly.
  • Added a battery of tests to make this more reliable in future.
  • Adds an optimization to loading in to_tf_dataset() for unshuffled datasets (uses slicing instead of a list of indices)

Fixes #4772

Rocketknight1 avatar Aug 02 '22 16:08 Rocketknight1

The documentation is not available anymore as the PR was closed or merged.

Why don't we deprecate label renaming already instead ?

lhoestq avatar Aug 17 '22 14:08 lhoestq

I think it'll break a lot of workflows if we deprecate it now! There isn't really a non-deprecated workflow yet - once we've added the auto_rename_labels option, then we can have prepare_tf_dataset on the transformers side use that, and then we can consider setting the default option to False, or beginning to deprecate it somehow.

Rocketknight1 avatar Aug 17 '22 16:08 Rocketknight1

I'm worried it's a bit of a waste of time to continue working on this behavior that shouldn't be here in the first place. Do you have a plan in mind ?

lhoestq avatar Aug 18 '22 15:08 lhoestq

@lhoestq Broadly! The plan is:

  1. Create the auto_rename_labels flag with this PR and skip label renaming if it isn't set. Leave it as True for backward compatibility.
  2. Add the label renaming logic to model.prepare_tf_dataset in transformers. That method calls to_tf_dataset() right now. Once the label renaming logic is moved there, model.prepare_tf_dataset will set auto_rename_labels=False when calling to_tf_dataset(), and do label renaming itself.

After step 2, auto_rename_labels is now only necessary for backward compatibility when users use to_tf_dataset directly. I want to leave it alone for a while because the model.prepare_tf_dataset workflow is very new. However, once it is established, we can deprecate auto_rename_labels and then finally remove it from the datasets code and keep it in transformers where it belongs.

Rocketknight1 avatar Aug 18 '22 16:08 Rocketknight1

I see ! Could it be possible to not add auto_rename_labels at all, since you want to remove it at the end ? Something roughly like this:

  1. show a warning in to_tf_dataset whevener a label is renamed automatically, saying that in the next major release this will be removed
  2. add the label renaming logic in transformers (to not have the warning)
  3. after some time, do a major release 3.0.0 and remove label renaming completely in to_tf_dataset

What do you think ? cc @LysandreJik in case you have an opinion on this process.

lhoestq avatar Aug 19 '22 11:08 lhoestq

@lhoestq I think that plan is mostly good, but if we make the change to datasets first then all users will keep getting deprecation warnings until we update the method in transformers and release a new version.

I think we can follow your plan, but make the change to transformers first and wait for a new release before changing datasets - that way there are no visible warnings or API changes for users using prepare_tf_dataset. It also gives us more time to update the docs and try to move people to prepare_tf_dataset so they aren't confused by this!

Rocketknight1 avatar Sep 05 '22 16:09 Rocketknight1

Sounds good to me ! To summarize:

  1. add the label renaming logic in transformers + release
  2. show a warning in to_tf_dataset whevener a label is renamed automatically, saying that in the next major release this will be removed + minor release
  3. after some time, do a major release 3.0.0 and remove label renaming completely in to_tf_dataset

lhoestq avatar Sep 05 '22 18:09 lhoestq

Yep, that's the plan!

Rocketknight1 avatar Sep 06 '22 11:09 Rocketknight1

@lhoestq Are you okay with me merging this for now?

Rocketknight1 avatar Sep 07 '22 14:09 Rocketknight1

Can you remove auto_rename_labels ? I don't think it's a good idea to add it if the plan is to remove it later

lhoestq avatar Sep 08 '22 16:09 lhoestq

Right now, the auto_rename_labels behaviour happens in all cases! Making it an option is the first step in the process of disabling it (and moving the functionality to transformers) and then finally deprecating it.

Rocketknight1 avatar Sep 08 '22 16:09 Rocketknight1