Handle texts longer than 512 tokens in BERT token classification pipeline
What does this PR do?
Implementation of a BERT-based token classification pipeline which can truncate texts longer than the max token length (512 or otherwise as specified by the user). Long texts are truncated to the max length, with overflowing tokens shifted to subsequent batch items. These are reconstituted as a post-processing step to return an output of the same shape as the inputs. Entity objects can be reconstituted based on a number of different strategies, which can be selected at pipeline creation time.
Fixes #15177
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [X] Did you read the contributor guideline, Pull Request section?
- [X] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [X] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [X] Did you write any new necessary tests?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @Narsil
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.
@Narsil I've updated the test cases as discussed on #15177. These and the docstrings should show how the new pipeline class is intended to work. Not sure why the CircleCI tasks are failing to checkout the PR branch, but the tests pass on my machine? Let me know if this is something I need to change
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@Narsil could this PR be reopened?
Is this PR just blocked for review, or something else is required? I worked on a similar issue recently and wanted to contribute.
It was ready for review but the checks were failing. Looks like I need to update a few things now to get the tests to pass, but otherwise it's good for review
Hey sorry for keeping your in the dark.
We've discussed this internally and I forgot to keep you up-to-date.
Several things came out:
- This is a very large proposal, unlikely to be merged in the current state (Adding a new pipeline is a big burden for maintainers, and this one in the current state feels particularly complex).
- Adding a parameter option to handle long text is something desirable, but not something to be added at all costs.
There are several ways forward we can recommend:
-
Trimming down a lot the code to be more manageable within
TokenClassificationPipeline. Emphasis on not too much code and readable/understandable code is important. We could also mark the parameter as experimental for a while, so that if maintenance becomes too big to bear, we can put the code elsewhere at a later stage. And we would keep it if it wasn't. Since this route is going moving from Pipeline to ChunkPipeline, I would advise into having 2 PRs at least (1 for thePipeline -> ChunkPipelinewhich cannot change any existing test code, and then we could add the new parameter. I don't want to discourage, but this is likely to take a bit of time and energy (trying to manage expectations). Ultimately it would be available to all though ! -
Using a remote pipeline https://huggingface.co/docs/transformers/v4.25.1/en/add_new_pipeline#how-to-create-a-custom-pipeline That's an easy was to share your work as a starter and do not require any review process so you can get started really fast.
There probably a few changes too that we should recommend either way.
Do not use parameterized. Put the values of the test, within the tests ,not outside (same for expected values). In general remove every indirection possible within tests. The only benefit of a test is to be as readable as possible, so humans can infer if the test is valid or not just by reading it.
Do not use @overload.
Do not use self. for pipeline arguments (we can't use that since pipelines are stateless, we need to use _sanitize_parameters(..) . Despite being a bit clunky it's the only way to accept every arg BOTH in pipeline(,, myarg=1) and pipe = pipeline(..); pipe(..., myarg=1)
Thanks for working on this though, this is definitely a nice thing to have, and a difficult one to add, so thanks for tackling it.
Thanks @Narsil for taking a look. That sounds sensible regarding including this feature as a parameter within TokenClassificationPipeline. I agree that that makes more sense than a separate Pipeline, and smaller PRs make sense towards implementing it. Not sure how much time I'll have to dedicate to the changes, but I'll try to get something together when I can.
As for the suggested changes, happy to remove parameterized and the use of self in those stateless functions.
What's the issue with using @overload? It helps with type checking and hints in an IDE when a function can take and receive multiple types for an argument. Is there an alternative place this type hint should go, like a .pyi file?
What's the issue with using
@overload?
It makes code reading much harder since you now don't know where a function starts and where it finishes. Also functions should only have 1 signature, this makes reasoning and calling it much easier. Making functions too dynamic is a real issue and should ideally be contained as much as possible. IDE support is a nice touch, but in my experience it's impossible to satisfy all of IDEs/repl correctly on actual complicated (not complex) behavior. (And since complicated behavior is also hard to read, my personal take is that we should just avoid it)
pipeline for instance is a magic function and making all sort of crazy runtime inference to infer what to do is kind of the purpose to make user's lives easy. However, the number of such functions should be as limited as possible.
Pipelines in the past have been way too tolerant which makes modifying them sometimes painful (everything was done with no regression for now, but there's still room for cleaning up inputs and outputs to make the experience cleaner).
Is there an alternative place this type hint should go, like a
.pyifile?
Type hints go in the signature directly, and if they are not enough on their own, documentation should provide the answer IMHO.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@jammie19 any updates on this PR? I am interested in contributing, and would be happy to help and discuss further.
@vaibhavk97 Hey, sorry I haven't had chance to work on it since the last discussion. I agree that the next step is to change the main token classification pipeline to support chunking, so I think the pieces are there in my PR but it'll need restarting. Happy for you to work on it/help if you're able to
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@Narsil it looks like you and others have now implemented the main part of this in #21771. I can see that the stride is used to reconstruct full sentences, with a nicer implementation than mine 👍
You probably inspired the others ! :D