haystack icon indicating copy to clipboard operation
haystack copied to clipboard

feat: Updated EntityExtractor to handle long texts and added better postprocessing

Open sjrl opened this issue 2 years ago • 7 comments

Related Issues

  • addresses part of #2969. Namely the postprocessing of results and being able to handle long text documents without truncation.

Proposed Changes:

This PR updates a number of aspects to the EntityExtractor node in haystack.

  • Update to default NER model to one of similar size, but with better metrics https://huggingface.co/elastic/distilbert-base-cased-finetuned-conll03-english
  • Addition of the aggregation_strategy option (set to first as default) which mitigates some of the issues identified in issue #1706. More explanation is provided in issue #1706.
  • Moved away from using the TokenClassificationPipeline provided by HuggingFace. This resulted in the largest changes because the HF pipeline silently truncated all document texts passed to it. Instead, following the inspiration of the Reader node, I added functionality to split long texts and feed each split into the model individually (or batches). Afterward, I recombine the splits grouped by the document they originally came from.
  • Added the option flatten_entities_in_meta_data so the entities can be stored in the metadata in a manner that can be used by the OpenSearch document store
  • Added a test for using the EntityExtractor in and indexing pipeline

How did you test it?

I made sure that the current tests for the EntityExtractor node still pass in the tests.

Notes for the reviewer

  • [x] Need to test EntityExtractor node in an indexing pipeline

Checklist

sjrl avatar Sep 05 '22 10:09 sjrl

@sjrl this looks great! Is it fully compatible with the run and run_batch results of the previous EntityExtractor?

vblagoje avatar Sep 12 '22 13:09 vblagoje

@sjrl this looks great! Is it fully compatible with the run and run_batch results of the previous EntityExtractor?

Yes! I checked that the output between the two methods was consistent. And there are original tests for both run and run_batch in test/nodes/test_extractor.py which still pass with the new code.

However, the old tests only covered running the EntityExtractor in a query pipeline, so I would also like to add tests for an indexing pipeline as well.

sjrl avatar Sep 13 '22 06:09 sjrl

@vblagoje Should I label this PR with breaking change since I've changed the default behavior of the EntityExtractor? Specifically, I've changed the default model and I have changed the default aggregation_strategy (it used to be simple by default).

sjrl avatar Sep 13 '22 06:09 sjrl

@vblagoje Should I label this PR with breaking change since I've changed the default behavior of the EntityExtractor? Specifically, I've changed the default model and I have changed the default aggregation_strategy (it used to be simple by default).

To be honest I am not 100% sure. Let's ask @masci what exactly can we change between minor releases.

vblagoje avatar Sep 13 '22 08:09 vblagoje

@sjrl it's breaking only if we change the API, not the implementation. Would I be able to run a script (or a pipeline) written for Haystack 1.8.0 without touching the code after this change?

masci avatar Sep 13 '22 08:09 masci

@sjrl it's breaking only if we change the API, not the implementation. Would I be able to run a script (or a pipeline) written for Haystack 1.8.0 without touching the code after this change?

Yes you would be able to since all newly added parameters to the API have default values so code written for Haystack 1.8.0 should all still work.

And the run and run_batch methods both work on the same data types as before and return outputs in the same format.

sjrl avatar Sep 13 '22 09:09 sjrl

It seems that the predictions of the node are a bit off, as they are very long. So the conversion of the labels should be checked again.

ju-gu avatar Sep 20 '22 07:09 ju-gu

It seems that the predictions of the node are a bit off, as they are very long. So the conversion of the labels should be checked again.

Converted back to a draft because we identified that Hugging Face does not handle NER models with SentencePiece tokenizers very well. I will be adding future commits to address this.

sjrl avatar Sep 30 '22 19:09 sjrl

Hey, @vblagoje I would appreciate a review of the new code since your last review. I wrote a description of the new changes under the Update: header in the PR description (also reproduced below).

Update:

  • Added new option pre_split_text so the user can optionally split all text by white space before passing it to the token classification model. This is common practice for NER pipelines, but is not out of the box supported by HuggingFace. As a result, more functionality was added to handle the post-processing of the model predictions when the text is pre-split into words. Namely, we determine word boundaries using the self.tokenizer.word_ids method and we update the character spans of the detected entities to correctly map back to the original (unsplit) text.
  • Added new optional parameter ignore_labels to allow users to specify what labels they would like to ignore.

sjrl avatar Oct 07 '22 11:10 sjrl

When running the NER node (and the unit tests) this warning from HF appears

Process finished with exit code 0
PASSED [100%]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Extracting entities: 100%|██████████| 3/3 [00:00<00:00, 35.96it/s]

This stems from HF being worried that we call the Fast tokenizer before creating a pytorch DataLoader. According to the second answer in this this Stack Overflow Post this should be fine as long as the Fast Tokenizer is not called from within the DataLoader (excerpt from answer):

Alternatively convert your data to tokens beforehand and store them in a dict. Then your dataset should not use the tokenizer at all but during runtime simply calls the dict(key) where key is the index. This way you avoid conflict. The warning still comes but you simply dont use tokeniser during training any more (note for such scenarios to save space, avoid padding during tokenise and add later with collate_fn)

As recommended we create store the tokens in a dict and pass that to the DataLoader.

sjrl avatar Oct 12 '22 10:10 sjrl

@sjrl Why not add a unit test for a document whose length is larger than tokenizer max_seq_len to capture the driving force behind this PR and also to the unit test as well? Not sure if we already have such a doc somewhere in unit tests, but if not we can provide one in the unit test itself.

vblagoje avatar Oct 12 '22 10:10 vblagoje

@sjrl Why not add a unit test for a document whose length is larger than tokenizer max_seq_len to capture the driving force behind this PR and also to the unit test as well? Not sure if we already have such a doc somewhere in unit tests, but if not we can provide one in the unit test itself.

Ahh I did do this by setting max_seq_len=6 for a few of the unit tests. So instead of a long doc I just shortened the amount of text the model can ingest in each step.

sjrl avatar Oct 12 '22 11:10 sjrl

  • [x] Add unit test using Jean-Baptiste/camembert-ner to test that "first" aggregation method works for it now (did not before)

sjrl avatar Oct 12 '22 11:10 sjrl