transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Handle texts longer than 512 tokens in BERT token classification pipeline

Open jamescw19 opened this issue 3 years ago • 12 comments

What does this PR do?

Implementation of a BERT-based token classification pipeline which can truncate texts longer than the max token length (512 or otherwise as specified by the user). Long texts are truncated to the max length, with overflowing tokens shifted to subsequent batch items. These are reconstituted as a post-processing step to return an output of the same shape as the inputs. Entity objects can be reconstituted based on a number of different strategies, which can be selected at pipeline creation time.

Fixes #15177

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @Narsil

jamescw19 avatar Oct 18 '22 19:10 jamescw19

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@Narsil I've updated the test cases as discussed on #15177. These and the docstrings should show how the new pipeline class is intended to work. Not sure why the CircleCI tasks are failing to checkout the PR branch, but the tests pass on my machine? Let me know if this is something I need to change

jamescw19 avatar Nov 08 '22 22:11 jamescw19

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Dec 04 '22 15:12 github-actions[bot]

@Narsil could this PR be reopened?

jamescw19 avatar Dec 13 '22 16:12 jamescw19

Is this PR just blocked for review, or something else is required? I worked on a similar issue recently and wanted to contribute.

vaibhavk97 avatar Dec 24 '22 18:12 vaibhavk97

It was ready for review but the checks were failing. Looks like I need to update a few things now to get the tests to pass, but otherwise it's good for review

jamescw19 avatar Dec 24 '22 18:12 jamescw19

Hey sorry for keeping your in the dark.

We've discussed this internally and I forgot to keep you up-to-date.

Several things came out:

  • This is a very large proposal, unlikely to be merged in the current state (Adding a new pipeline is a big burden for maintainers, and this one in the current state feels particularly complex).
  • Adding a parameter option to handle long text is something desirable, but not something to be added at all costs.

There are several ways forward we can recommend:

  • Trimming down a lot the code to be more manageable within TokenClassificationPipeline. Emphasis on not too much code and readable/understandable code is important. We could also mark the parameter as experimental for a while, so that if maintenance becomes too big to bear, we can put the code elsewhere at a later stage. And we would keep it if it wasn't. Since this route is going moving from Pipeline to ChunkPipeline, I would advise into having 2 PRs at least (1 for the Pipeline -> ChunkPipeline which cannot change any existing test code, and then we could add the new parameter. I don't want to discourage, but this is likely to take a bit of time and energy (trying to manage expectations). Ultimately it would be available to all though !

  • Using a remote pipeline https://huggingface.co/docs/transformers/v4.25.1/en/add_new_pipeline#how-to-create-a-custom-pipeline That's an easy was to share your work as a starter and do not require any review process so you can get started really fast.

There probably a few changes too that we should recommend either way.

Do not use parameterized. Put the values of the test, within the tests ,not outside (same for expected values). In general remove every indirection possible within tests. The only benefit of a test is to be as readable as possible, so humans can infer if the test is valid or not just by reading it. Do not use @overload. Do not use self. for pipeline arguments (we can't use that since pipelines are stateless, we need to use _sanitize_parameters(..) . Despite being a bit clunky it's the only way to accept every arg BOTH in pipeline(,, myarg=1) and pipe = pipeline(..); pipe(..., myarg=1)

Thanks for working on this though, this is definitely a nice thing to have, and a difficult one to add, so thanks for tackling it.

Narsil avatar Dec 26 '22 11:12 Narsil

Thanks @Narsil for taking a look. That sounds sensible regarding including this feature as a parameter within TokenClassificationPipeline. I agree that that makes more sense than a separate Pipeline, and smaller PRs make sense towards implementing it. Not sure how much time I'll have to dedicate to the changes, but I'll try to get something together when I can.

As for the suggested changes, happy to remove parameterized and the use of self in those stateless functions. What's the issue with using @overload? It helps with type checking and hints in an IDE when a function can take and receive multiple types for an argument. Is there an alternative place this type hint should go, like a .pyi file?

jamescw19 avatar Dec 28 '22 11:12 jamescw19

What's the issue with using @overload?

It makes code reading much harder since you now don't know where a function starts and where it finishes. Also functions should only have 1 signature, this makes reasoning and calling it much easier. Making functions too dynamic is a real issue and should ideally be contained as much as possible. IDE support is a nice touch, but in my experience it's impossible to satisfy all of IDEs/repl correctly on actual complicated (not complex) behavior. (And since complicated behavior is also hard to read, my personal take is that we should just avoid it)

pipeline for instance is a magic function and making all sort of crazy runtime inference to infer what to do is kind of the purpose to make user's lives easy. However, the number of such functions should be as limited as possible. Pipelines in the past have been way too tolerant which makes modifying them sometimes painful (everything was done with no regression for now, but there's still room for cleaning up inputs and outputs to make the experience cleaner).

Is there an alternative place this type hint should go, like a .pyi file?

Type hints go in the signature directly, and if they are not enough on their own, documentation should provide the answer IMHO.

Narsil avatar Dec 28 '22 13:12 Narsil

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jan 21 '23 15:01 github-actions[bot]

@jammie19 any updates on this PR? I am interested in contributing, and would be happy to help and discuss further.

vaibhavk97 avatar Jan 26 '23 20:01 vaibhavk97

@vaibhavk97 Hey, sorry I haven't had chance to work on it since the last discussion. I agree that the next step is to change the main token classification pipeline to support chunking, so I think the pieces are there in my PR but it'll need restarting. Happy for you to work on it/help if you're able to

jamescw19 avatar Jan 26 '23 21:01 jamescw19

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Feb 20 '23 15:02 github-actions[bot]

@Narsil it looks like you and others have now implemented the main part of this in #21771. I can see that the stride is used to reconstruct full sentences, with a nicer implementation than mine 👍

jamescw19 avatar May 30 '23 13:05 jamescw19

You probably inspired the others ! :D

Narsil avatar Jun 05 '23 14:06 Narsil