transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Raise error if `stride` is too high in `TokenClassificationPipeline`

Open boyleconnor opened this issue 1 year ago • 1 comments

What does this PR do?

Users were previously not given a warning if they initialized a TokenClassificationPipeline with too high a value for stride (stride is the value that determines how many tokens overlap between chunks if the user choose to split text into chunks).

Unfortunately, it's also possible for a stride to be too high if the tokenizer happens to introduce special tokens (e.g. bert-base-cased has a maximum length of 512, but each window gets 2 special tokens, so the highest valid stride is 509) , but there's apparently no easy way to check this in advance (i.e. before the tokenizer is run as part of the pipeline). I think it might be worth fixing the error message ("pyo3_runtime.PanicException: assertion failed: stride < max_len") when a tokenizer is called with too high a value of stride, to clarify to users that added special tokens subtract from the effective window size.

I also thought it was worth clarifying slightly the function of the stride parameter. The way stride works in the context of Huggingface tokenizers is almost the opposite of the way it works in many other contexts.

Mostly fixes #22789.

Before submitting

  • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [x] Did you read the contributor guideline, Pull Request section?
  • [x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • [x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • [ ] Did you write any new necessary tests?

Who should review?

@Narsil

boyleconnor avatar Apr 22 '23 23:04 boyleconnor

The documentation is not available anymore as the PR was closed or merged.