transformers
transformers copied to clipboard
Trainer removes columns before transform is called
System Info
transformersversion: 4.26.1- Platform: Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
- Python version: 3.10.8
- Huggingface_hub version: 0.12.1
- PyTorch version (GPU?): 1.13.1+cu116 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no
Who can help?
@sgugger
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
I have a text dataset and am attempting to apply a transform to tokenize the contents. I'm using: with_transform() for this and it works fine: the transform removes the text column and adds the input_ids and attention_mask columns.
The problem is when combining this with the Trainer, it runs _remove_unused_columns() before calling the transform, which has the effect of removing the whole dataset, and I get an error as it tries to read the first batch:
IndexError: Invalid key: 664 is out of bounds for size 0
Expected behavior
I should be able to combine Dataset.with_transform() and Trainer.
Ah, I've re-read through the parameter lists for everything and found remove_unused_columns=False in TrainingArguments. Setting this resolves the issue, so I guess this won't be considered a bug. I think there's room for improvement in the UX though, perhaps a warning "After removing unused columns, there were no columns left, this is probably not what you meant to do, right?"
Like if set(dataset.column_names) == set(ignored_columns)...
We could add such a warning yes. Do you want to take a stab at a PR?
Sorry, I've got a full plate at the moment.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
We could add such a warning yes. Do you want to take a stab at a PR?
Just ran into this issue, would like to create a PR for creating a warning about no columns being left during the _remove_unused_columns() call