Extend support for token classification
Is your feature request related to a problem? Please describe.
It would be great if the support for token classification could be extended beyond what the Extractor currently offers. Specifically, we'd also need training and evaluation for token classification models. The node should also be able to support splitting and aggregation of longer texts to work around the 512 token limit present in most language models.
Describe the solution you'd like
Extension / re-implementation of the Extractor node to support the additional features.
Additionally, we want to consider different postprocessing strategies when combining the predicted labels together. For example the prediction ["B-DEFENDER", "I-DEFENDER"] will be combined into one entity, but what should be done with a prediction like ["O", "I-DEFENDER", "O"]?
@sjrl was this resolved by #3154 ?
Hi @masci, PR #3154 partially resolves this issue. The PR did not add the training and evaluation of token classification models. I can edit the text of the main issue to better reflect the remaining tasks.