flair icon indicating copy to clipboard operation
flair copied to clipboard

Issue with custom dataset after updating to flair 0.11

Open stefanobranco opened this issue 2 years ago • 5 comments

Describe the bug We're using flair to perform named entity recognition to identify specific parts of a document as part of a citation to a different documents. Our dataset consists of space separated tokens and labels, like this:

Vgl. O
Rundschreiben O
RAB PARTA
1/2010 YEAR
Rz MISC
8. MISC

I'm reading in the dataset with a columncorups like this:

# define columns
columns = {0: 'text', 1: 'ner'}

# this is the folder in which train, test and dev files reside
data_folder = '../Data/Flair/Regex_Tagging_Full'

# init a corpus using column format, data folder and the names of the train, dev and test files
corpus: Corpus = ColumnCorpus(data_folder, columns,
                              in_memory=True,
                              document_separator_token="-DOCSTART-")

From my understanding, the expected corpus would be something like this:

image

This is also the way it has worked in 0.10. However, every since upgrading to 0.11, our dataset is being ripped apart, and our labels are cut down in weird ways (looks like the first two characters are replaced with a forward slash?): image

I understand the labeling logic has been refactored, but I don't assume the change in behaviour is intended, or is there a setting about labels that I'm missing?

This also doesn't just seem to be a display issue in the dataset, since this causes an entirely incorrect label dictionary to be created full of broken labels.

stefanobranco avatar Apr 11 '22 09:04 stefanobranco

Hello @stefanobranco I am not seeing this behavior with your snippet. I get the following printout:

Sentence: "Vgl. Rundschreiben RAB 1/2010 Rz 8." → ["RAB"/PARTA, "1/2010"/YEAR, "Rz"/MISC, "8."/MISC]

and if I do:

for entity in corpus.train[0].get_labels('ner'):
    print(entity)

I get:

Token[2]: "RAB" → PARTA (1.0)
Token[3]: "1/2010" → YEAR (1.0)
Token[4]: "Rz" → MISC (1.0)
Token[5]: "8." → MISC (1.0)

So it seems that everything is working as it should.

alanakbik avatar Apr 12 '22 04:04 alanakbik

Hi @alanakbik! Thanks for the feedback. I completely uninstalled the flair package and then reinstalled it, and now I can no longer reproduce the problem either. It seems something must have gone wrong during the update on my end. Sorry for the confusion!

stefanobranco avatar Apr 12 '22 05:04 stefanobranco

Hey @alanakbik! Sorry to dig this out again, but turns out the issue is not quite resolved after all, and I think I figured out the root cause. We are using document separator tokens to signify boundaries of documents. The problem appears only if the training file starts with such a document separator token:

-DOCSTART-

Vgl. O
Rundschreiben O
RAB PARTA
1/2010 YEAR
Rz MISC
8. MISC

This happens regardless of whether the document_separator_token value is set or not. Is it incorrect to have a document separator token right at the start in the first place? It seemed sensible to me, since it's called "-DOCSTART-" in all examples, but I guess functionally it might also just start appearing after the first document. I'm not even sure this is a bug, but since this behaved differently in 0.10 I figured it's probably still worth looking into.

stefanobranco avatar Apr 12 '22 18:04 stefanobranco

@stefanobranco just merged a PR that should make span detection more robust and hopefully cover your case (DOCSTART as first sentence).

alanakbik avatar May 09 '22 03:05 alanakbik

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 09 '22 02:09 stale[bot]