transformers
transformers copied to clipboard
Raise error if `stride` is too high in `TokenClassificationPipeline`
What does this PR do?
Users were previously not given a warning if they initialized a TokenClassificationPipeline
with too high a value for stride
(stride
is the value that determines how many tokens overlap between chunks if the user choose to split text into chunks).
Unfortunately, it's also possible for a stride
to be too high if the tokenizer happens to introduce special tokens (e.g. bert-base-cased
has a maximum length of 512
, but each window gets 2
special tokens, so the highest valid stride
is 509
) , but there's apparently no easy way to check this in advance (i.e. before the tokenizer is run as part of the pipeline). I think it might be worth fixing the error message ("pyo3_runtime.PanicException: assertion failed: stride < max_len
") when a tokenizer is called with too high a value of stride
, to clarify to users that added special tokens subtract from the effective window size.
I also thought it was worth clarifying slightly the function of the stride
parameter. The way stride
works in the context of Huggingface tokenizers is almost the opposite of the way it works in many other contexts.
Mostly fixes #22789.
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [x] Did you read the contributor guideline, Pull Request section?
- [x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [ ] Did you write any new necessary tests?
Who should review?
@Narsil
The documentation is not available anymore as the PR was closed or merged.