transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Chunkable token classification pipeline

Open luccailliau opened this issue 2 years ago • 31 comments

This PR improve the TokenClassificationPipeline by extending its usage to tokenized texts longer than model_max_length by returning overflowing tokens as chunks rather than truncating texts. To enable the use of this extended feature, you must use a fast tokenizer with an aggregation strategy different to "none" and provide a stride number.

What does this PR do?

Fixes # (issue)

Before submitting

  • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [x] Did you read the contributor guideline, Pull Request section?
  • [x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • [ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

luccailliau avatar Feb 23 '23 23:02 luccailliau

The documentation is not available anymore as the PR was closed or merged.

For whoever is reading this, I will quickly correct the alignment issues which concerns tokens that are removed due to special tokens mask :)

luccailliau avatar Feb 24 '23 00:02 luccailliau

cc @Narsil I'll let you review and decide if it maybe would be more suitable as code on the hub or needs to be in Transformers.

sgugger avatar Feb 24 '23 08:02 sgugger

Thank you for this PR. It looks promising.

is now able to process sequences longer than 512.

Do you have a specific model in mind, 512 seems oddly specific.

if it maybe would be more suitable as code on the hub or needs to be in Transformers.

I think it really depends on the complexity of the resulting code, and the cross over with other parameters.

@LucCailliau the tests aren't passing yet. I don't want to do a full review before the tests are passing.

Some notes:

  • Existing tests cannot be modified and must be passing
  • Overall code should be relatively simple (the current state looks ok on the surface).
  • The most complex part (I think it should be the conflict resolution in overlapping parts) must be as clear as possible. It's ok to put it into another function for special handling.
  • Unless it causes an explosion in complexity, it should work with all aggregation_strategy. The minimum are NONE and SIMPLE. If it causes and explosion in complexity, we need to forbid the use of the unsupported combinations
  • We need good tests for this feature:
    • Make sure it actually solves what this PR is set to do (handler longer than model_max_length inputs`). (Both a slow and fast test)
    • Check with aggregation_strategy parameters.
    • Check errors if preventing some parameter combo.

Does this make sense ?

For the sake of getting this moving forward faster I actually suggest splitting this over several PRs.

The first PR should be the move to ChunkPipeline and nothing else should be modified. Then adding the new parameter. We would need the second one, to be close enough to good state to merge the first (there's no point in changing the pipeline type if we're not able to support this process_all parameter correctly.

Don't hesitate to ask more question should you need it.

Narsil avatar Feb 24 '23 09:02 Narsil

@Narsil, all tests passed except the code quality. I used black but it doesn't pass. I also update the schema above to explain the algorithm for update/aggregate scores

luccailliau avatar Feb 24 '23 15:02 luccailliau

@Narsil, all tests passed except the code quality. I used black but it doesn't pass. I also update the schema above to explain the algorithm for update/aggregate scores

try pip install -e .[quality] && make fixup to use the correct black version.

Narsil avatar Feb 24 '23 16:02 Narsil

@Narsil, wait, just one more thing to add and we can go

luccailliau avatar Feb 27 '23 07:02 luccailliau

@Narsil, I checked everything on my side, we can go for it :)

luccailliau avatar Feb 27 '23 09:02 luccailliau

@Narsil,

  • I updated the documentation as mentioned
  • Updated sanitize_parameters, you just have to provide stride now
  • I left unchanged the forward method as we pass only the tokenizer inputs to the model
  • Correct spaces for readability
  • I also provided an example above between the current implementation with this one

What do you think about it?

luccailliau avatar Feb 27 '23 16:02 luccailliau

@Narsil, Finally, it is better to not update the scores and merge results after entities aggregation. Each chunk is entirely pass through the model hence and we aggregate the results.

With the sentence:

  • [...] Paris Hilton [...]

We have the corresponding chunks:

  • [...] Paris
  • [...] Paris Hilton [...]
  • Hilton [...]

With the following entities:

  • Paris -> LOC
  • Paris Hilton -> PER
  • Hilton -> ORG

The first step consist of merging results backward and now entities become:

  • Paris -> LOC
  • Paris Hilton -> PER

The last step, then merging results forward to get the desired entity:

  • Paris Hilton -> PER

If we found different entities at the same start/end index, we take the longest one and if lengths are equal, we take the highest score. This approach is clearly better. We passed all the tests.

I'll start creating a validation set to have results of what we've done.

luccailliau avatar Feb 27 '23 23:02 luccailliau

I'll start creating a validation set to have results of what we've done.

Sounds great ! Again don't hesitate to ask for resources for larger runs.

Narsil avatar Feb 28 '23 08:02 Narsil

I'll start creating a validation set to have results of what we've done.

Sounds great ! Again don't hesitate to ask for resources for larger runs.

Great, do you have a specific model and/or dataset to perform our tests? I am also interested for resources.

The tests must be done with aggregation_strategy to "simple", "first", "average" and "max". If None is set we can have the following: "We went to Manhattan Burgers, it was awesome!"

  • "Manhattan" -> "B-LOC"
  • "Burgers" -> "I-ORG"

And here again we can update scores as we did previously, but this is not perfect. Corrections can only be applied if aggregation_strategy different from None.

luccailliau avatar Feb 28 '23 09:02 luccailliau

@Narsil, I finished the tests for this pipeline and the results are convincing since it improves the current implementation.

A summary of the PR:

This PR improve the TokenClassificationPipeline by extending its usage to tokenized texts longer than model_max_length by returning overflowing tokens as chunks rather than truncating texts. To enable the use of this extended feature, you must use a fast tokenizer with an aggregation strategy different to "none" and provide a stride number.

Approaches:

Different approaches have been experimented:

  1. Updating tokens scores in overlapping parts.
  2. Aggregating entities for all chunks, no matter the overlapping parts.

The first approach which consist of updating scores with the highest is not the best. A more "confident" token doesn't mean the more likely it is, adversarial attacks show that plenty.

The second approach (selected approach) consist of processing each chunk and aggregate entities no matter overlapping part. In the final aggregation step, we select the best entity in overlapping parts with a rule. We first look at the longest entity and if entities have the same length, we take the entity with the highest score. Note that taking the best entity first on its length, then on the highest score (if lengths are equal) give better results than just taking the highest score.

Example: Given the following entities from their respective chunk: "New York" -> "LOC": from chunk no. 1 "New York City" -> "LOC": from chunk no. 2 The remaining entity in aggregated entities will be "New York City" -> "LOC"

Results

In order to compare the current implementation with this one, we generated labeled text from the conll2003 dataset (available on the hub). Then, compared the number of exact match and wrong match only in the first chunk for each implementation. You can download the notebook as HTML: token_classification_comparison.zip

We have in 378 texts (with more than 1 chunk): aggregation_strategy="simple"

  • 12862 exact matches and 2042 wrong matches for the proposed implementation
  • 12739 exact matches and 2181 wrong matches for the current implementation

aggregation_strategy="first"

  • 13083 exact matches and 1415 wrong matches for the proposed implementation
  • 12984 exact matches and 1478 wrong matches for the current implementation

aggregation_strategy="average"

  • 13009 exact matchs and 1390 wrong matches for the proposed implementation
  • 12921 exact matchs and 1436 wrong matches for the current implementation

aggregation_strategy="max"

  • 13037 exact matches and 1399 wrong matches for the proposed implementation
  • 12944 exact matches and 1453 wrong matches for the current implementation

Implementation

We only changed the implementation from Pipeline to ChunkPipeline. The underlying tests for this pipeline remain the same as functions don't change since the previous implementation. Each chunk is processed individually. Entities are aggregated in a new function called aggregate_entities.

luccailliau avatar Mar 03 '23 15:03 luccailliau

I haven't forgotten this PR, it seems to have some external interest.

I wanted to dedicate some good time for a proper review and didn't have a lot. I'm looking at it tomorrow.

Thanks for your work !

Narsil avatar Mar 09 '23 17:03 Narsil

And here are more complete logs so you can inspect a bit more the edgy cases if you want:

wikiann all languages x top 5 token classification models

Overall it seems good enough to add to me.

Narsil avatar Mar 10 '23 18:03 Narsil

Do you agree with the proposed test for actually checking how it performs on stride conflicts ?

It's a good approach too. It makes more sense since concatenate sentences can lead to new randomly generated sentences. However, with this new script, we need to be careful of:

  • most of sentences in datasets fit in a sequence of length 50 (uninteresting is incremented when it's the case)
  • we take as references outputs of the pipeline without chunking+striding and not the references from the dataset itself
  • only one non-detected entity in a sentence with the pipeline chunking+striding will cause an entity offset and lead to wrong results

luccailliau avatar Mar 11 '23 14:03 luccailliau

  • There's an optional printing to see the different tokens. I looked at it , and didn't see anything shocking, some spans are a bit different, some group are different and without context it's hard to tell who's correct.

I noticed the same behavior with the previous tests. In our case, the aim is not to apply corrections on predictions but to aggregate entities. If the model is 100% correct, the aggregated entities will also be 100% correct.

luccailliau avatar Mar 11 '23 14:03 luccailliau

And here are more complete logs so you can inspect a bit more the edgy cases if you want:

I think you forgot to link the results

wikiann all languages x top 5 token classification models

Overall it seems good enough to add to me.

Yes, with the previous tests and with manual tests on different content, it's working as expected.

luccailliau avatar Mar 11 '23 14:03 luccailliau

I refactored the code as you mentioned. All is good on my side

luccailliau avatar Mar 11 '23 14:03 luccailliau

Oops: https://gist.github.com/Narsil/e8609805e8e52c7e4114586eede8a481

Narsil avatar Mar 13 '23 10:03 Narsil

Oops: https://gist.github.com/Narsil/e8609805e8e52c7e4114586eede8a481

Good results!

luccailliau avatar Mar 13 '23 12:03 luccailliau

@Narsil, I updated the comments in the Files Changed section that give a better explanation of the aggregate_overlapping_entities() method. I don't know if you received a notification for that. Is it good for you?

luccailliau avatar Mar 17 '23 09:03 luccailliau

I didn't but we're still missing some tests though.

I offered to help writing them if you want, but we really want tests to show case this feature (and make sure it doesn't break later).

The most important will be small tests (with tiny models) so that they run always. And showcase what's happening on simple strings while setting model_max_length to something super tiny to force the chunking to happen.

Narsil avatar Mar 17 '23 11:03 Narsil

I thought it was for @sgugger since you already created a script that shows it works. @Narsil your help is welcome. I have in mind to create specific tests with manually checked references to ensure it doesn't break later.

luccailliau avatar Mar 17 '23 11:03 luccailliau

No I pinged him because the code was clean enough to be looked at, but the tests (especially for such a complex feature) are mandatory.

If you can get started that would be immensely appreciated, but please share if you're having a hard time or don't have the bandwidth to do it. It shouldn't take me that much time but it's always better if you can write the tests yourself.

Narsil avatar Mar 17 '23 11:03 Narsil

Great, I'll do it

luccailliau avatar Mar 17 '23 11:03 luccailliau

@Narsil, I finished the tests. The selected model is elastic/distilbert-base-uncased-finetuned-conll03-english (pytorch_model.bin of size 266MB). It is not a tiny model as you recommend. I tried different tiny models but unfortunately, they don't perform well on our example. In fact, it is not noticeable when running since it requires between 2s and 3s to run all the new tests.

The tests are composed in two parts:

  1. test_chunking(): Test the pipeline with all aggregation strategies and match an hard coding output
  2. test_regular_chunk(): Test the pipeline with all aggregation strategies and match output (without scores) between regular output (without chunking+striding) and chunked output (with chunking+striding)

The second test, test_regular_chunk() can be optional since the first, test_chunking() was created with the same rule: regular output match chunked output. But, even if the regular output should not be interpreted as a reference, in this case, it is good to show that we didn't create a flawed test.

The selected parameters for these tests are:

  • model_max_length=10: extremely tiny to increase the difficulty
  • stride=5: quite large (regarding model_max_length) to generate multiple chunks (15 chunks in our example)

You can also find below the output without aggregate_overlapping_entities() (but sorted by "start" for more readability) to see how the tests cover the different overlapping cases with the text: "Hugging Face, Inc. is a French company that develops tools for building applications using machine learning. The company, based in New York City was founded in 2016 by French entrepreneurs Clément Delangue, Julien Chaumond, and Thomas Wolf."

image

And below, the same output with aggregate_overlapping_entities():

image

Is it good for you?

luccailliau avatar Mar 19 '23 14:03 luccailliau

Hello @Narsil, I don't know if you didn't receive the update for the tests or if you don't have time to look at it. Do not hesitate if you need additional work to be done

luccailliau avatar Mar 22 '23 12:03 luccailliau

@luccailliau I took the liberty of adding directly the small test I had in mind. and pushing the other 2 tests to slow tests (since they use real models). They are good tests btw !

Great! Yes for sure, it is not good to use a real model. I also look at hf-internal-testing model but didn't found hf-internal-testing/tiny-bert-for-token-classification

luccailliau avatar Mar 22 '23 16:03 luccailliau

@Narsil, thank you very much for your help and your time on this PR, it was a pleasure! @sgugger, I updated your changes, ready to go

luccailliau avatar Mar 22 '23 17:03 luccailliau