flair icon indicating copy to clipboard operation
flair copied to clipboard

[Bug]: TypeError: 'Token' object is not subscriptable

Open kdk2612 opened this issue 1 year ago • 7 comments

Describe the bug

Getting an error when trying to train NER model using custom dataset. This was working back in Dec 2023. I have trained a model using the same data and FLAIR version 0.13.1 but not sure what has changed since then. I have padded sentence in the data, eg. "[PAD] X" is the data set if the size of the sentence is greater than 512 tokens.

I printed every span that the model is reading, and see that for some SPAN I am getting back the TOKEN object. I am not sure what is going on.

To Reproduce


import flair
print(flair.__version__)

import torch
print(torch.__version__)

print(torch.cuda.is_available())
from flair.data import Corpus
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings
from typing import List
import pandas as pd
from flair.models import SequenceTagger
from flair.data import Corpus
from flair.datasets import ColumnCorpus
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings
from typing import List
from flair.embeddings import WordEmbeddings, FlairEmbeddings, StackedEmbeddings ,FastTextEmbeddings, PooledFlairEmbeddings
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings
from typing import List
import logging
import sys
import time
from flair.data import Corpus
from flair.datasets import ColumnCorpus
from tqdm import tqdm
from flair.data import Corpus
from flair.datasets import ColumnCorpus
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings
from typing import List
from pathlib import Path

# define columns
columns = {0: 'text', 1: 'ner'}

data_folder = "Data/new_sample"

# init a corpus using column format, data folder and the names of the train, dev and test files
corpus: Corpus = ColumnCorpus(data_folder, columns,
                              train_file='train/flair_train_re_iobes.txt',
                              test_file='flair_test_re_iobes.txt',
                              dev_file='flair_val_re_iobes.txt',
                             in_memory=False)


# In[ ]:


from flair.embeddings import WordEmbeddings, FlairEmbeddings, StackedEmbeddings ,FastTextEmbeddings, PooledFlairEmbeddings
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings
from typing import List


# 2. what tag do we want to predict?
tag_type = 'ner'

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_label_dictionary(label_type=tag_type)
print(tag_dictionary.idx2item)
print(tag_dictionary.get_items())

# 4. initialize embeddings
# !unset LD_LIBRARY_PATH
embedding_fldr = "EmbeddingData/FreeTextLarge"
embedding_types: List[TokenEmbeddings] = [
    FlairEmbeddings(f'{embedding_fldr}/Fullset_Forward/best-lm.pt'),
    FlairEmbeddings(f'{embedding_fldr}/Fullset_Backward/best-lm.pt'),
]
embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

from flair.models import SequenceTagger
from flair.trainers import ModelTrainer

model_fldr = "NER/FreeText"
# 4. initialize sequence tagger
from flair.models import SequenceTagger

tagger: SequenceTagger = SequenceTagger(
    hidden_size=256,
    embeddings=embeddings,
    tag_dictionary=tag_dictionary,
    tag_type="ner",
    use_crf=True,
    use_rnn=True,
    rnn_layers=1,
    word_dropout = 0.05,
    locked_dropout=0.5,
    train_initial_hidden_state = False,
    rnn_type = 'LSTM'
)

# 5. initialize trainer
from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train(
    base_path=f'{model_fldr}/NER_EW_all_2',
    train_with_dev=True,
    max_epochs=2,
    learning_rate=0.1,
    mini_batch_size=64,
    monitor_test=True,
    embeddings_storage_mode="cpu",
)```


### Expected behavior

Would train the model properly using the CONLL data format 

### Logs and Stack traces

```stacktrace
This is the SPAN "Span| 35:38]: "U. s" - geography (1.0)"
Span|287:290]: "U. S"- geography' (1.0)
Span[287:290]: "U. S" → geography (1.0)
3
This is the SPAN "Span[287:290]: "U . S"
geography (1.0)"
Span| 302:305]: "U • S" - geography (1.0)
Span[302:305]: "U • S" - geography (1.0)
3
This is the SPAN "Span|302:305]: "U. S"
- geography (1.0)"
Span|508:512]: "[PAD] [PAD] IPADI [PAD]" →
Span[508:512]: "[PAD] [PAD] [PAD] [PAD]" → (1.0)
4
This is the SPAN "Span[508:512]: "[PAD] [PAD] [PAD]
IPADI" → (1.0)"
Token [186]: "D" → B-geography (1.0)
Token [186]: "D" - B-geography (1.0)
1
This is the SPAN "Token( 186]: "D" - B-geography (1.0)"Leng is 1: {span}
File "/home/ec2-user/SageMaker/flairtrainnerFreeTextERFE.py", line 160, in ‹module› trainer.train(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/flair/trainers/trainer.py", line 200, in tr ain return self. train_custom(**local variables,
File "/home/ec2-user/anaconda3/envs/pytorch_p310/1ib/python3.10/site-packages/flair/trainers/trainer.py", line 601, in tr ain_custom
loss, datapoint_count = self.model. forward_
loss (batch
File "/home/ec2-user/anaconda3/envs/pytorch_p310/1ib/python3.10/site-packages/flair/models/sequence_tagger_model.py", lin e 276, in forward_loss
gold_labels = self._ prepare.
_label_tensor (sentences)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/flair/models/sequence_tagger_model.py", lin e 427, in prepare_label _tensor
gold labels = self._ get_gold_labels(sentences)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/1ib/python3.10/site-packages/flair/models/sequence_tagger_model-py", lin e 406, in _get_gold_ _labels
sentence
_labels[span[0].idx - 1] = "S-" +

Screenshots

No response

Additional Context

No response

Environment

python 3.10 flair 0.13.1

kdk2612 avatar Jun 21 '24 03:06 kdk2612

@alanakbik Please provide guidance, this is urgent

kdk2612 avatar Jun 21 '24 03:06 kdk2612

Hello @kdk2612 the training script looks good, but I don't see where the printouts in the stacktrace are coming from ("This is the SPAN ..."). Did you modify other parts of the code?

Are you calling get_labels() somewhere near this printout? If you want only the span annotations for NER, you should call get_labels('ner') instead, as otherwise it will also iterate over the token-level annotations.

For us to be able to help, we'd need a runnable script (including dataset and embeddings) that throws the error. But from the printouts, I assume the error is thrown in custom code outside the library.

alanakbik avatar Jun 21 '24 13:06 alanakbik

Yes I added the print statements in the get_gold_labels() other than that I am not making any changes. This happens during training the model and evaluation. I am not able to finish the training because of the above error

The error is happening bcoz I have some tokens "[PAD] X" in this format, this is the token + Label. My assumption is that the error happens because the X Label is expected to have a prefix "S-" or "B-" etc.

kdk2612 avatar Jun 21 '24 16:06 kdk2612

So the label is only "X"? Have you tried replacing the label with "B-X"?

Could you paste a part of the column corpus in plain text?

alanakbik avatar Jun 21 '24 16:06 alanakbik

unfortunately I cant share the data, but here is what it looks like

Token O Token O Token O Token O Token O Token O , O Token O [PAD] X

Token O Token O Token O Token O Token O Token O ' O Token O Token O Token O Token O Token O Token O Token O Token O . O [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X

kdk2612 avatar Jun 21 '24 18:06 kdk2612

Ok, can you try replacing "X" with "B-X"? It should work then.

alanakbik avatar Jun 21 '24 18:06 alanakbik

Yes, I used "S-" instead of "B-" Will let you know how this goes. Thanks for taking the time

kdk2612 avatar Jun 21 '24 18:06 kdk2612