flair icon indicating copy to clipboard operation
flair copied to clipboard

[Question]: Issue with NER Label Recognition using external annotated dataset

Open SPVillacorta opened this issue 11 months ago • 9 comments

Question

Hi Flair Community, I've got an annotated dataset in BIO format that I'm attempting to use with Flair for annotating 7 PDFs. Unfortunately, all my metric results are consistently zero. I'm seeking guidance and expertise to understand why the labels might not be recognized in this context. Any insights or advice on improving label recognition in Flair would be greatly appreciated. Thank you!

SPVillacorta avatar Sep 08 '23 02:09 SPVillacorta

Hi @Sanpau2022

please note that your results are heavily depending on your choice of model and the dataset you created. As you haven't shared information like the training script or training logs, we can only guess what might help you.

My guess would be, that 7 is a very small dataset size, usually one wants to start with at least 100 training examples but prefer 1000.

helpmefindaname avatar Sep 11 '23 07:09 helpmefindaname

Thanks, I resolved the formatting issue by converting my files from TXT to CSV. However, I've encountered a new challenge. The code is not functioning as expected. Below is the code I am trying to use, including the error message I received during execution. I would greatly appreciate your insights or assistance!

import flair import glob import nltk import os import pandas as pd import pdfplumber from flair.embeddings import FlairEmbeddings, StackedEmbeddings from flair.models import SequenceTagger from flair.trainers import ModelTrainer from flair.data import Sentence, Corpus

nltk.download("punkt")

MODEL_DIR = "./model" DATA_DIR = "./data"

def read_csv_to_sentences(csv_file_path: str): df = pd.read_csv(csv_file_path) sentences = [] current_sentence = [] for _, row in df.iterrows(): token, label = row['text'], row['label'] if token == '': sentences.append(Sentence(current_sentence)) current_sentence = [] else: current_sentence.append(f"{token} <{label}> ") if current_sentence: sentences.append(Sentence(current_sentence)) return sentences

def train(data_dir: str, model_dir: str): train_data = read_csv_to_sentences(f"{data_dir}/train.csv") dev_data = read_csv_to_sentences(f"{data_dir}/dev.csv") test_data = read_csv_to_sentences(f"{data_dir}/test.csv")

corpus = Corpus(train=train_data, dev=dev_data, test=test_data)

label_type = 'ner'
tag_dictionary = corpus.make_label_dictionary(label_type=label_type)

embeddings: StackedEmbeddings = StackedEmbeddings([
    FlairEmbeddings("mix-forward"),
    FlairEmbeddings("mix-backward"),
])

tagger: SequenceTagger = SequenceTagger(
    hidden_size=256,
    embeddings=embeddings,
    tag_dictionary=tag_dictionary,
    tag_type=label_type,
    use_crf=True,
)

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train(
    model_dir,
    learning_rate=0.2,
    mini_batch_size=30,
    max_epochs=100,
)

train(DATA_DIR, MODEL_DIR)


[nltk_data] Downloading package punkt to /home/jovyan/nltk_data... [nltk_data] Package punkt is already up-to-date! 2023-09-12 03:55:47,092 Computing label dictionary. Progress: 1it [00:00, 3231.36it/s] 2023-09-12 03:55:47,097 ERROR: You specified label_type='ner' which is not in this dataset! 2023-09-12 03:55:47,098 ERROR: The corpus contains the following label types:


Exception Traceback (most recent call last) /tmp/ipykernel_290/178478599.py in <cell line: 65>() 63 ) 64 ---> 65 train(DATA_DIR, MODEL_DIR)

/tmp/ipykernel_290/178478599.py in train(data_dir, model_dir) 39 40 label_type = 'ner' ---> 41 tag_dictionary = corpus.make_label_dictionary(label_type=label_type) 42 43 embeddings: StackedEmbeddings = StackedEmbeddings([

~/venvs/kgenv03/lib/python3.10/site-packages/flair/data.py in make_label_dictionary(self, label_type, min_count, add_unk) 1465 ) 1466 log.error(f"ERROR: The corpus contains the following label types: {contained_labels}") -> 1467 raise Exception 1468 1469 log.info(

Exception:

SPVillacorta avatar Sep 12 '23 01:09 SPVillacorta

Hi again,

as the warning shows:

2023-09-12 03:55:47,097 ERROR: You specified label_type='ner' which is not in this dataset!
2023-09-12 03:55:47,098 ERROR: The corpus contains the following label types:
<Empty line>

You have not added any labels in your conversion method. Instead, you have added the labels as part of the token text.

You can use the following function to create a sentence out of the list of tokens and token-labels (assuming BIO format) to add the labels to the sentence:

from typing import List
from flair.data import get_spans_from_bio, Sentence


def create_labeled_sentence(tokens: List[str], tag_labels: List[str]) -> Sentence:
    sentence = Sentence(tokens)
    predicted_spans = get_spans_from_bio(tag_labels)
    for idx, _, label in predicted_spans:
        if value == "O":
            continue
        span = sentence[idx[0]: idx[-1] + 1]
        span.add_label("ner", value=label)
    return sentence

This converts the BIO-tags to the target spans used in flair.

helpmefindaname avatar Sep 12 '23 18:09 helpmefindaname

Thanks, I modified the code as you suggested:

import flair
import glob
import nltk
import os
import pandas as pd
import pdfplumber
from flair.embeddings import FlairEmbeddings, StackedEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
from flair.data import Sentence, Corpus, get_spans_from_bio
from typing import List

nltk.download("punkt")

MODEL_DIR = "./model"
DATA_DIR = "./data"


# Function to create a labeled sentence
def create_labeled_sentence(tokens: List[str], tag_labels: List[str]) -> Sentence:
    sentence = Sentence(tokens)
    predicted_spans = get_spans_from_bio(tag_labels)
    for idx, _, label in predicted_spans:
        if label == "O":
            continue
        span = sentence[idx[0]: idx[-1] + 1]
        span.add_label("ner", value=label)
    return sentence

# Update read_csv_to_sentences to use create_labeled_sentence
def read_csv_to_sentences(csv_file_path: str):
    df = pd.read_csv(csv_file_path)
    sentences = []
    current_tokens = []
    current_labels = []
    for _, row in df.iterrows():
        token, label = row['text'], row['label']
        if token == '':
            if current_tokens and current_labels:
                sentences.append(create_labeled_sentence(current_tokens, current_labels))
            current_tokens = []
            current_labels = []
        else:
            current_tokens.append(token)
            current_labels.append(label)
    if current_tokens and current_labels:
        sentences.append(create_labeled_sentence(current_tokens, current_labels))
    return sentences


def train(data_dir: str, model_dir: str):
    train_data = read_csv_to_sentences(f"{data_dir}/train.csv")
    dev_data = read_csv_to_sentences(f"{data_dir}/dev.csv")
    test_data = read_csv_to_sentences(f"{data_dir}/test.csv")

    corpus = Corpus(train=train_data, dev=dev_data, test=test_data)

    label_type = 'ner'
    tag_dictionary = corpus.make_label_dictionary(label_type=label_type)

    embeddings: StackedEmbeddings = StackedEmbeddings([
    FlairEmbeddings("mix-forward"),
    FlairEmbeddings("mix-backward"),
    ])

    tagger: SequenceTagger = SequenceTagger(
    hidden_size=256,
    embeddings=embeddings,
    tag_dictionary=tag_dictionary,
    tag_type=label_type,
    use_crf=True,
    )

    trainer: ModelTrainer = ModelTrainer(tagger, corpus)

    trainer.train(
    model_dir,
    learning_rate=0.2,
    mini_batch_size=30,
    max_epochs=100,
    )
    
train(DATA_DIR, MODEL_DIR)

However, it is sill not working. After running that, I received:

TypeError Traceback (most recent call last) /tmp/ipykernel_96/2207024690.py in <cell line: 85>() 83 ) 84 ---> 85 train(DATA_DIR, MODEL_DIR)

/tmp/ipykernel_96/2207024690.py in train(data_dir, model_dir) 52 53 def train(data_dir: str, model_dir: str): ---> 54 train_data = read_csv_to_sentences(f"{data_dir}/train.csv") 55 dev_data = read_csv_to_sentences(f"{data_dir}/dev.csv") 56 test_data = read_csv_to_sentences(f"{data_dir}/test.csv")

/tmp/ipykernel_96/2207024690.py in read_csv_to_sentences(csv_file_path) 47 current_labels.append(label) 48 if current_tokens and current_labels: ---> 49 sentences.append(create_labeled_sentence(current_tokens, current_labels)) 50 return sentences 51

/tmp/ipykernel_96/2207024690.py in create_labeled_sentence(tokens, tag_labels) 21 # Function to create a labeled sentence 22 def create_labeled_sentence(tokens: List[str], tag_labels: List[str]) -> Sentence: ---> 23 sentence = Sentence(tokens) 24 predicted_spans = get_spans_from_bio(tag_labels) 25 for idx, _, label in predicted_spans:

~/venvs/kgenv03/lib/python3.10/site-packages/flair/data.py in init(self, text, use_tokenizer, language_code, start_position) 715 else: 716 words = cast(List[str], text) --> 717 text = " ".join(words) 718 719 # determine token positions and whitespace_after flag

TypeError: sequence item 16: expected str instance, float found

SPVillacorta avatar Sep 13 '23 03:09 SPVillacorta

This looks like csv file contains some empty texts, that pandas will replace by the nan value. You can check those values with pd.isnan(...) and exclude them.

Btw when sharing code, please use syntax-highlighting for the code, e.g. by writing:

```python
# code goes here
if True:
   print("Hello World")
```

we'll see properly highlighted code:

# code goes here
if True:
   print("Hello World")

which makes it way easier to read and understand you comments.

helpmefindaname avatar Sep 18 '23 08:09 helpmefindaname

I checked the files and they look OK, so you might want to have a look at the code I originally used when working with TXT annotations:

import os
import flair
from flair.data import Corpus, Sentence
from flair.datasets import ColumnCorpus
from flair.embeddings import FlairEmbeddings, StackedEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
import pdfplumber

MODEL_DIR = "./model" # Folder to save the model
DATA_DIR = "./data" # Folder containing train, dev, test
PDF_DIR = "./pdfs"  # Folder containing PDF files

def pdf_to_conll(pdf_dir: str, data_dir: str):
    # Implement the function to convert PDF files to the required format using pdfplumber
    pass

def train(data_dir: str, model_dir: str):
    pdf_to_conll(PDF_DIR, DATA_DIR)

    assert os.path.isdir(
        data_dir
    ), "Directory for data does not exist - please create and add data then try again."
    assert os.path.isdir(
        model_dir
    ), "Directory for model does not exist - please create and try again."

    columns = {0: 'text', 1: 'ner'}
    corpus: Corpus = ColumnCorpus(data_dir, columns,
                                  train_file="train.txt",
                                  dev_file="dev.txt",
                                  test_file="test.txt")

    label_type = 'ner'
    tag_dictionary = corpus.make_label_dictionary(label_type=label_type)

    embeddings: StackedEmbeddings = StackedEmbeddings(
        [
            FlairEmbeddings("mix-forward"),
            FlairEmbeddings("mix-backward"),
        ]
    )

     # 6. initialize sequence tagger
    tagger: SequenceTagger = SequenceTagger(
        hidden_size=256,
        embeddings=embeddings,
        tag_dictionary=tag_dictionary,
        tag_type=label_type,
        use_crf=True,
    )

    # 7. initialize trainer 
    trainer: ModelTrainer = ModelTrainer(tagger, corpus)
    
  # 8. start training
    trainer.train(
        model_dir,
        learning_rate=0.1,
        mini_batch_size=2,
        max_epochs=50,
        embeddings_storage_mode=None,
    )

if __name__ == "__main__":
    train(DATA_DIR, MODEL_DIR)

However, my BIO-formatted labels seem not being recognised, as when running that code, the metrics are all zero. What can I do?

SPVillacorta avatar Sep 20 '23 05:09 SPVillacorta

Actually, I notice differences among the TXT file which works (right side) and the one which doesn't (left side) Differences

SPVillacorta avatar Sep 20 '23 06:09 SPVillacorta

Can you share the logs of you training run? There are various reasons why a ML model could not learn. It wouldn't help you if I just guess what the problem could be

helpmefindaname avatar Sep 25 '23 09:09 helpmefindaname

I solved this issue by taking my previously successful txt files (train, test, dev) and integrating the content of the non-functional txt as new entries. Then runing the files it works.

SPVillacorta avatar Oct 26 '23 06:10 SPVillacorta