[Bug]: TypeError: 'Token' object is not subscriptable
Describe the bug
Getting an error when trying to train NER model using custom dataset. This was working back in Dec 2023. I have trained a model using the same data and FLAIR version 0.13.1 but not sure what has changed since then. I have padded sentence in the data, eg. "[PAD] X" is the data set if the size of the sentence is greater than 512 tokens.
I printed every span that the model is reading, and see that for some SPAN I am getting back the TOKEN object. I am not sure what is going on.
To Reproduce
import flair
print(flair.__version__)
import torch
print(torch.__version__)
print(torch.cuda.is_available())
from flair.data import Corpus
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings
from typing import List
import pandas as pd
from flair.models import SequenceTagger
from flair.data import Corpus
from flair.datasets import ColumnCorpus
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings
from typing import List
from flair.embeddings import WordEmbeddings, FlairEmbeddings, StackedEmbeddings ,FastTextEmbeddings, PooledFlairEmbeddings
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings
from typing import List
import logging
import sys
import time
from flair.data import Corpus
from flair.datasets import ColumnCorpus
from tqdm import tqdm
from flair.data import Corpus
from flair.datasets import ColumnCorpus
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings
from typing import List
from pathlib import Path
# define columns
columns = {0: 'text', 1: 'ner'}
data_folder = "Data/new_sample"
# init a corpus using column format, data folder and the names of the train, dev and test files
corpus: Corpus = ColumnCorpus(data_folder, columns,
train_file='train/flair_train_re_iobes.txt',
test_file='flair_test_re_iobes.txt',
dev_file='flair_val_re_iobes.txt',
in_memory=False)
# In[ ]:
from flair.embeddings import WordEmbeddings, FlairEmbeddings, StackedEmbeddings ,FastTextEmbeddings, PooledFlairEmbeddings
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings
from typing import List
# 2. what tag do we want to predict?
tag_type = 'ner'
# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_label_dictionary(label_type=tag_type)
print(tag_dictionary.idx2item)
print(tag_dictionary.get_items())
# 4. initialize embeddings
# !unset LD_LIBRARY_PATH
embedding_fldr = "EmbeddingData/FreeTextLarge"
embedding_types: List[TokenEmbeddings] = [
FlairEmbeddings(f'{embedding_fldr}/Fullset_Forward/best-lm.pt'),
FlairEmbeddings(f'{embedding_fldr}/Fullset_Backward/best-lm.pt'),
]
embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
model_fldr = "NER/FreeText"
# 4. initialize sequence tagger
from flair.models import SequenceTagger
tagger: SequenceTagger = SequenceTagger(
hidden_size=256,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type="ner",
use_crf=True,
use_rnn=True,
rnn_layers=1,
word_dropout = 0.05,
locked_dropout=0.5,
train_initial_hidden_state = False,
rnn_type = 'LSTM'
)
# 5. initialize trainer
from flair.trainers import ModelTrainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)
trainer.train(
base_path=f'{model_fldr}/NER_EW_all_2',
train_with_dev=True,
max_epochs=2,
learning_rate=0.1,
mini_batch_size=64,
monitor_test=True,
embeddings_storage_mode="cpu",
)```
### Expected behavior
Would train the model properly using the CONLL data format
### Logs and Stack traces
```stacktrace
This is the SPAN "Span| 35:38]: "U. s" - geography (1.0)"
Span|287:290]: "U. S"- geography' (1.0)
Span[287:290]: "U. S" → geography (1.0)
3
This is the SPAN "Span[287:290]: "U . S"
geography (1.0)"
Span| 302:305]: "U • S" - geography (1.0)
Span[302:305]: "U • S" - geography (1.0)
3
This is the SPAN "Span|302:305]: "U. S"
- geography (1.0)"
Span|508:512]: "[PAD] [PAD] IPADI [PAD]" →
Span[508:512]: "[PAD] [PAD] [PAD] [PAD]" → (1.0)
4
This is the SPAN "Span[508:512]: "[PAD] [PAD] [PAD]
IPADI" → (1.0)"
Token [186]: "D" → B-geography (1.0)
Token [186]: "D" - B-geography (1.0)
1
This is the SPAN "Token( 186]: "D" - B-geography (1.0)"Leng is 1: {span}
File "/home/ec2-user/SageMaker/flairtrainnerFreeTextERFE.py", line 160, in ‹module› trainer.train(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/flair/trainers/trainer.py", line 200, in tr ain return self. train_custom(**local variables,
File "/home/ec2-user/anaconda3/envs/pytorch_p310/1ib/python3.10/site-packages/flair/trainers/trainer.py", line 601, in tr ain_custom
loss, datapoint_count = self.model. forward_
loss (batch
File "/home/ec2-user/anaconda3/envs/pytorch_p310/1ib/python3.10/site-packages/flair/models/sequence_tagger_model.py", lin e 276, in forward_loss
gold_labels = self._ prepare.
_label_tensor (sentences)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/flair/models/sequence_tagger_model.py", lin e 427, in prepare_label _tensor
gold labels = self._ get_gold_labels(sentences)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/1ib/python3.10/site-packages/flair/models/sequence_tagger_model-py", lin e 406, in _get_gold_ _labels
sentence
_labels[span[0].idx - 1] = "S-" +
Screenshots
No response
Additional Context
No response
Environment
python 3.10 flair 0.13.1
@alanakbik Please provide guidance, this is urgent
Hello @kdk2612 the training script looks good, but I don't see where the printouts in the stacktrace are coming from ("This is the SPAN ..."). Did you modify other parts of the code?
Are you calling get_labels() somewhere near this printout? If you want only the span annotations for NER, you should call get_labels('ner') instead, as otherwise it will also iterate over the token-level annotations.
For us to be able to help, we'd need a runnable script (including dataset and embeddings) that throws the error. But from the printouts, I assume the error is thrown in custom code outside the library.
Yes I added the print statements in the get_gold_labels() other than that I am not making any changes. This happens during training the model and evaluation. I am not able to finish the training because of the above error
The error is happening bcoz I have some tokens "[PAD] X" in this format, this is the token + Label. My assumption is that the error happens because the X Label is expected to have a prefix "S-" or "B-" etc.
So the label is only "X"? Have you tried replacing the label with "B-X"?
Could you paste a part of the column corpus in plain text?
unfortunately I cant share the data, but here is what it looks like
Token O Token O Token O Token O Token O Token O , O Token O [PAD] X
Token O Token O Token O Token O Token O Token O ' O Token O Token O Token O Token O Token O Token O Token O Token O . O [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X [PAD] X
Ok, can you try replacing "X" with "B-X"? It should work then.
Yes, I used "S-" instead of "B-" Will let you know how this goes. Thanks for taking the time