flair icon indicating copy to clipboard operation
flair copied to clipboard

[Bug]: JsonlDataset cannot pass tokenizer

Open david-waterworth opened this issue 2 months ago • 3 comments

Describe the bug

I'm trying to port an AllenNLP model to a framework that's still maintained so am considering flair. My original model is a character LSTM based tagger. It's character based because it consists of abbreviated sensor point names that are hard to reliably tokenize (i.e. "AHU-01-L2.ZnTSP" or "BLD2.LV3,VAV 01-02 DMPOS").

In particular whilst I've created custom huggingface tokenizers using various regexes etc it's not possible to ensure that any tokenizer other than a character based one always ensures a split before and after any entity I want to tag.

Also whitespace is important (or at least whitespace and punctuation are used interchangably and it doesn't make sense to drop whitespace and retain punctuation). For this reason I cannot use the standard CoNLL format as it doesn't allow for whitespace tokens as far as I know?

So I would generally use a spacy compatible jsonl format consisting of text and label_spans, which JsonlDataset supports, but there's no allowance for any tokenization other than the default, i.e.

https://github.com/flairNLP/flair/blob/ca1b90bff70fe322087618994500d8c5f8e91d17/flair/datasets/sequence_labeling.py#L215

Also the code to match character indices to token indicates returns the token index of the preceding token, not the actual matched token, i.e.

https://github.com/flairNLP/flair/blob/ca1b90bff70fe322087618994500d8c5f8e91d17/flair/datasets/sequence_labeling.py#L253

I think the conditions should be

            if token.start_position <= start < token.end_position and start_idx == -1:
            ...

            if token.start_position < end <= token.end_position and end_idx == -1:

To Reproduce

from flair.datasets.sequence_labeling import JsonlDataset

train = JsonlDataset("data/processed/train.jsonl", text_column_name="text", label_column_name="spans")

Expected behavior

Sentences created by the JsonlDataset should accept the use_tokenizer parameter

Logs and Stack traces

No response

Screenshots

No response

Additional Context

No response

Environment

n/a

david-waterworth avatar Jun 24 '24 01:06 david-waterworth