datasets
datasets copied to clipboard
Fix label renaming and add a battery of tests
This PR makes some changes to label renaming in to_tf_dataset()
, both to fix some issues when users input something we weren't expecting, and also to make it easier to deprecate label renaming in future, if/when we want to move this special-casing logic to a function in transformers
.
The main changes are:
- Label renaming now only happens when the
auto_rename_labels
argument is set. For backward compatibility, this defaults toTrue
for now. - If the user requests "label" but the data collator renames that column to "labels", the label renaming logic will now handle that case correctly.
- Added a battery of tests to make this more reliable in future.
- Adds an optimization to loading in
to_tf_dataset()
for unshuffled datasets (uses slicing instead of a list of indices)
Fixes #4772
The documentation is not available anymore as the PR was closed or merged.
Why don't we deprecate label renaming already instead ?
I think it'll break a lot of workflows if we deprecate it now! There isn't really a non-deprecated workflow yet - once we've added the auto_rename_labels
option, then we can have prepare_tf_dataset
on the transformers
side use that, and then we can consider setting the default option to False
, or beginning to deprecate it somehow.
I'm worried it's a bit of a waste of time to continue working on this behavior that shouldn't be here in the first place. Do you have a plan in mind ?
@lhoestq Broadly! The plan is:
- Create the
auto_rename_labels
flag with this PR and skip label renaming if it isn't set. Leave it asTrue
for backward compatibility. - Add the label renaming logic to
model.prepare_tf_dataset
intransformers
. That method callsto_tf_dataset()
right now. Once the label renaming logic is moved there,model.prepare_tf_dataset
will setauto_rename_labels=False
when callingto_tf_dataset()
, and do label renaming itself.
After step 2, auto_rename_labels
is now only necessary for backward compatibility when users use to_tf_dataset
directly. I want to leave it alone for a while because the model.prepare_tf_dataset
workflow is very new. However, once it is established, we can deprecate auto_rename_labels
and then finally remove it from the datasets
code and keep it in transformers
where it belongs.
I see ! Could it be possible to not add auto_rename_labels
at all, since you want to remove it at the end ? Something roughly like this:
- show a warning in
to_tf_dataset
whevener a label is renamed automatically, saying that in the next major release this will be removed - add the label renaming logic in
transformers
(to not have the warning) - after some time, do a major release 3.0.0 and remove label renaming completely in
to_tf_dataset
What do you think ? cc @LysandreJik in case you have an opinion on this process.
@lhoestq I think that plan is mostly good, but if we make the change to datasets
first then all users will keep getting deprecation warnings until we update the method in transformers
and release a new version.
I think we can follow your plan, but make the change to transformers
first and wait for a new release before changing datasets
- that way there are no visible warnings or API changes for users using prepare_tf_dataset
. It also gives us more time to update the docs and try to move people to prepare_tf_dataset
so they aren't confused by this!
Sounds good to me ! To summarize:
- add the label renaming logic in
transformers
+ release - show a warning in
to_tf_dataset
whevener a label is renamed automatically, saying that in the next major release this will be removed + minor release - after some time, do a major release 3.0.0 and remove label renaming completely in
to_tf_dataset
Yep, that's the plan!
@lhoestq Are you okay with me merging this for now?
Can you remove auto_rename_labels
? I don't think it's a good idea to add it if the plan is to remove it later
Right now, the auto_rename_labels
behaviour happens in all cases! Making it an option is the first step in the process of disabling it (and moving the functionality to transformers
) and then finally deprecating it.