haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Extend support for token classification

Open mathislucka opened this issue 3 years ago • 1 comments

Is your feature request related to a problem? Please describe. It would be great if the support for token classification could be extended beyond what the Extractor currently offers. Specifically, we'd also need training and evaluation for token classification models. The node should also be able to support splitting and aggregation of longer texts to work around the 512 token limit present in most language models.

Describe the solution you'd like Extension / re-implementation of the Extractor node to support the additional features.

mathislucka avatar Aug 04 '22 15:08 mathislucka

Additionally, we want to consider different postprocessing strategies when combining the predicted labels together. For example the prediction ["B-DEFENDER", "I-DEFENDER"] will be combined into one entity, but what should be done with a prediction like ["O", "I-DEFENDER", "O"]?

sjrl avatar Aug 05 '22 13:08 sjrl

@sjrl was this resolved by #3154 ?

masci avatar Nov 02 '22 07:11 masci

Hi @masci, PR #3154 partially resolves this issue. The PR did not add the training and evaluation of token classification models. I can edit the text of the main issue to better reflect the remaining tasks.

sjrl avatar Nov 14 '22 15:11 sjrl