flair
flair copied to clipboard
Issue with custom dataset after updating to flair 0.11
Describe the bug We're using flair to perform named entity recognition to identify specific parts of a document as part of a citation to a different documents. Our dataset consists of space separated tokens and labels, like this:
Vgl. O
Rundschreiben O
RAB PARTA
1/2010 YEAR
Rz MISC
8. MISC
I'm reading in the dataset with a columncorups like this:
# define columns
columns = {0: 'text', 1: 'ner'}
# this is the folder in which train, test and dev files reside
data_folder = '../Data/Flair/Regex_Tagging_Full'
# init a corpus using column format, data folder and the names of the train, dev and test files
corpus: Corpus = ColumnCorpus(data_folder, columns,
in_memory=True,
document_separator_token="-DOCSTART-")
From my understanding, the expected corpus would be something like this:
This is also the way it has worked in 0.10. However, every since upgrading to 0.11, our dataset is being ripped apart, and our labels are cut down in weird ways (looks like the first two characters are replaced with a forward slash?):
I understand the labeling logic has been refactored, but I don't assume the change in behaviour is intended, or is there a setting about labels that I'm missing?
This also doesn't just seem to be a display issue in the dataset, since this causes an entirely incorrect label dictionary to be created full of broken labels.
Hello @stefanobranco I am not seeing this behavior with your snippet. I get the following printout:
Sentence: "Vgl. Rundschreiben RAB 1/2010 Rz 8." → ["RAB"/PARTA, "1/2010"/YEAR, "Rz"/MISC, "8."/MISC]
and if I do:
for entity in corpus.train[0].get_labels('ner'):
print(entity)
I get:
Token[2]: "RAB" → PARTA (1.0)
Token[3]: "1/2010" → YEAR (1.0)
Token[4]: "Rz" → MISC (1.0)
Token[5]: "8." → MISC (1.0)
So it seems that everything is working as it should.
Hi @alanakbik! Thanks for the feedback. I completely uninstalled the flair package and then reinstalled it, and now I can no longer reproduce the problem either. It seems something must have gone wrong during the update on my end. Sorry for the confusion!
Hey @alanakbik! Sorry to dig this out again, but turns out the issue is not quite resolved after all, and I think I figured out the root cause. We are using document separator tokens to signify boundaries of documents. The problem appears only if the training file starts with such a document separator token:
-DOCSTART-
Vgl. O
Rundschreiben O
RAB PARTA
1/2010 YEAR
Rz MISC
8. MISC
This happens regardless of whether the document_separator_token value is set or not. Is it incorrect to have a document separator token right at the start in the first place? It seemed sensible to me, since it's called "-DOCSTART-" in all examples, but I guess functionally it might also just start appearing after the first document. I'm not even sure this is a bug, but since this behaved differently in 0.10 I figured it's probably still worth looking into.
@stefanobranco just merged a PR that should make span detection more robust and hopefully cover your case (DOCSTART as first sentence).
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.