transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Trainer removes columns before transform is called

Open davidgilbertson opened this issue 2 years ago • 3 comments
trafficstars

System Info

  • transformers version: 4.26.1
  • Platform: Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
  • Python version: 3.10.8
  • Huggingface_hub version: 0.12.1
  • PyTorch version (GPU?): 1.13.1+cu116 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Who can help?

@sgugger

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

I have a text dataset and am attempting to apply a transform to tokenize the contents. I'm using: with_transform() for this and it works fine: the transform removes the text column and adds the input_ids and attention_mask columns.

The problem is when combining this with the Trainer, it runs _remove_unused_columns() before calling the transform, which has the effect of removing the whole dataset, and I get an error as it tries to read the first batch:

IndexError: Invalid key: 664 is out of bounds for size 0

Expected behavior

I should be able to combine Dataset.with_transform() and Trainer.

davidgilbertson avatar Mar 13 '23 00:03 davidgilbertson

Ah, I've re-read through the parameter lists for everything and found remove_unused_columns=False in TrainingArguments. Setting this resolves the issue, so I guess this won't be considered a bug. I think there's room for improvement in the UX though, perhaps a warning "After removing unused columns, there were no columns left, this is probably not what you meant to do, right?"

Like if set(dataset.column_names) == set(ignored_columns)...

davidgilbertson avatar Mar 13 '23 01:03 davidgilbertson

We could add such a warning yes. Do you want to take a stab at a PR?

sgugger avatar Mar 13 '23 13:03 sgugger

Sorry, I've got a full plate at the moment.

davidgilbertson avatar Mar 13 '23 20:03 davidgilbertson

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Apr 12 '23 15:04 github-actions[bot]

We could add such a warning yes. Do you want to take a stab at a PR?

Just ran into this issue, would like to create a PR for creating a warning about no columns being left during the _remove_unused_columns() call

kshitijkumbar avatar Nov 25 '23 01:11 kshitijkumbar