haystack feat: Updated EntityExtractor to handle long texts and added better postprocessing

Related Issues

addresses part of #2969. Namely the postprocessing of results and being able to handle long text documents without truncation.

Proposed Changes:

This PR updates a number of aspects to the EntityExtractor node in haystack.

Update to default NER model to one of similar size, but with better metrics https://huggingface.co/elastic/distilbert-base-cased-finetuned-conll03-english
Addition of the aggregation_strategy option (set to first as default) which mitigates some of the issues identified in issue #1706. More explanation is provided in issue #1706.
Moved away from using the TokenClassificationPipeline provided by HuggingFace. This resulted in the largest changes because the HF pipeline silently truncated all document texts passed to it. Instead, following the inspiration of the Reader node, I added functionality to split long texts and feed each split into the model individually (or batches). Afterward, I recombine the splits grouped by the document they originally came from.
Added the option flatten_entities_in_meta_data so the entities can be stored in the metadata in a manner that can be used by the OpenSearch document store
Added a test for using the EntityExtractor in and indexing pipeline

How did you test it?

I made sure that the current tests for the EntityExtractor node still pass in the tests.

Notes for the reviewer

[x] Need to test EntityExtractor node in an indexing pipeline

Checklist

[x] I have read the contributors guidelines and the code of conduct
[x] I have updated the related issue with new insights and changes
[x] I added tests that demonstrate the correct behavior of the change
[x] I've used the conventional commit convention for my PR title
[x] I documented my code
[x] I ran pre-commit hooks and fixed any issue

Sep 05 '22 10:09 sjrl

@sjrl this looks great! Is it fully compatible with the run and run_batch results of the previous EntityExtractor?

Sep 12 '22 13:09 vblagoje

@sjrl this looks great! Is it fully compatible with the run and run_batch results of the previous EntityExtractor?

Yes! I checked that the output between the two methods was consistent. And there are original tests for both run and run_batch in test/nodes/test_extractor.py which still pass with the new code.

However, the old tests only covered running the EntityExtractor in a query pipeline, so I would also like to add tests for an indexing pipeline as well.

Sep 13 '22 06:09 sjrl

@vblagoje Should I label this PR with breaking change since I've changed the default behavior of the EntityExtractor? Specifically, I've changed the default model and I have changed the default aggregation_strategy (it used to be simple by default).

Sep 13 '22 06:09 sjrl

@vblagoje Should I label this PR with breaking change since I've changed the default behavior of the EntityExtractor? Specifically, I've changed the default model and I have changed the default aggregation_strategy (it used to be simple by default).

To be honest I am not 100% sure. Let's ask @masci what exactly can we change between minor releases.

Sep 13 '22 08:09 vblagoje

@sjrl it's breaking only if we change the API, not the implementation. Would I be able to run a script (or a pipeline) written for Haystack 1.8.0 without touching the code after this change?

Sep 13 '22 08:09 masci

@sjrl it's breaking only if we change the API, not the implementation. Would I be able to run a script (or a pipeline) written for Haystack 1.8.0 without touching the code after this change?

Yes you would be able to since all newly added parameters to the API have default values so code written for Haystack 1.8.0 should all still work.

And the run and run_batch methods both work on the same data types as before and return outputs in the same format.

Sep 13 '22 09:09 sjrl

It seems that the predictions of the node are a bit off, as they are very long. So the conversion of the labels should be checked again.

Sep 20 '22 07:09 ju-gu

It seems that the predictions of the node are a bit off, as they are very long. So the conversion of the labels should be checked again.

Converted back to a draft because we identified that Hugging Face does not handle NER models with SentencePiece tokenizers very well. I will be adding future commits to address this.

Sep 30 '22 19:09 sjrl

Hey, @vblagoje I would appreciate a review of the new code since your last review. I wrote a description of the new changes under the Update: header in the PR description (also reproduced below).

Update:

Added new option pre_split_text so the user can optionally split all text by white space before passing it to the token classification model. This is common practice for NER pipelines, but is not out of the box supported by HuggingFace. As a result, more functionality was added to handle the post-processing of the model predictions when the text is pre-split into words. Namely, we determine word boundaries using the self.tokenizer.word_ids method and we update the character spans of the detected entities to correctly map back to the original (unsplit) text.

Added new optional parameter ignore_labels to allow users to specify what labels they would like to ignore.

Oct 07 '22 11:10 sjrl

When running the NER node (and the unit tests) this warning from HF appears

Process finished with exit code 0
PASSED [100%]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Extracting entities: 100%|██████████| 3/3 [00:00<00:00, 35.96it/s]

This stems from HF being worried that we call the Fast tokenizer before creating a pytorch DataLoader. According to the second answer in this this Stack Overflow Post this should be fine as long as the Fast Tokenizer is not called from within the DataLoader (excerpt from answer):

Alternatively convert your data to tokens beforehand and store them in a dict. Then your dataset should not use the tokenizer at all but during runtime simply calls the dict(key) where key is the index. This way you avoid conflict. The warning still comes but you simply dont use tokeniser during training any more (note for such scenarios to save space, avoid padding during tokenise and add later with collate_fn)

As recommended we create store the tokens in a dict and pass that to the DataLoader.

Oct 12 '22 10:10 sjrl

@sjrl Why not add a unit test for a document whose length is larger than tokenizer max_seq_len to capture the driving force behind this PR and also to the unit test as well? Not sure if we already have such a doc somewhere in unit tests, but if not we can provide one in the unit test itself.

Oct 12 '22 10:10 vblagoje

@sjrl Why not add a unit test for a document whose length is larger than tokenizer max_seq_len to capture the driving force behind this PR and also to the unit test as well? Not sure if we already have such a doc somewhere in unit tests, but if not we can provide one in the unit test itself.

Ahh I did do this by setting max_seq_len=6 for a few of the unit tests. So instead of a long doc I just shortened the amount of text the model can ingest in each step.

Oct 12 '22 11:10 sjrl

[x] Add unit test using Jean-Baptiste/camembert-ner to test that "first" aggregation method works for it now (did not before)

Oct 12 '22 11:10 sjrl

haystack haystack copied to clipboard

feat: Updated EntityExtractor to handle long texts and added better postprocessing

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

haystack
haystack copied to clipboard