NeuroNER
NeuroNER copied to clipboard
Adding a new entity type
I have created a dataset of adding a new entity type, let's say it's "XYZ" entity type and I have combined the new train, valid, test data with original CoNLL data in BRAT format. If I run the main.py script on this new data, I am not getting proper results! Precision and recall are 0% and I am not even seeing this new "XYZ" entity type on those matrices. As per the documentation, I have train, valid and test folder in BRAT format as it's easy for me to create a dataset in BRAT format, is there anything that I am missing?
Are you able to upload the first few lines of one of your annotation files?
Otherwise, are you able to open your dataset with the Brat server, and see annotated entries as you expect to see them?
@JohnGiorgi Attaching the snippet from train folder.. train_text_03477.txt
Compared to the Motorola H500 , this is a wonderful piece of gear although it is much larger .
I only buy HTC phones , and if this was " made by HTC " they had it contracted out .
Sure , it worked fast but I 'm bothered by what seems planned obsolescence by the manufacturers at Nokia .
But the graphics are awesome .. thak you NVIDIA .. for at least putting in good graphics ....
“ China and the U.S. are good partners now , with Boeing and Microsoft , ” Hu said .
train_text_03477.ann
T1 TECH 16 24 Motorola
T2 TECH 106 109 HTC
T3 TECH 145 148 HTC
T4 TECH 279 284 Nokia
T5 TECH 328 334 NVIDIA
T6 TECH 442 451 Microsoft
Okay... there is nothing obviously wrong with this. I would check a few things:
in the parameter file you provide when running NeuroNER
, make sure this line:
dataset_text_folder = ../data/example
points to some folder (in this case ../data/example
) with subdirectories train
, valid
and (optionally) test
, e.g. ../data/example/train
must exist.
Download BRAT (instructions for MacOS here, but there are instructions for Windows and Linux on this repo as well). And load up your dataset. Make sure the BRAT web-server is not throwing any errors or complaining about your dataset, and make sure the annotations appear as expected.
Other than that, I am not sure what is going wrong. Are you able to run NeuroNER
successfully with one of the given datasets?
@JohnGiorgi I can see the annotations, and they looks correct! Also, I can run the scripts without any errors, it's just I am not getting any results back!
Thanks for the reply anyway!
Very weird...
Someone who knows much more about NeuroNER
s inner workings will have to help you!
I was actually going through all the scripts and trying to debug the issue here, it's just seems little complex given the structure of the repo. Will post the solution if I'm able to fix this issue!
@JohnGiorgi Just if you want to see that output I am getting from running main.py with having everything in proper place!
Formatting train set from BRAT to CONLL... Done.
Converting CONLL from BIO to BIOES format... Done.
Formatting valid set from BRAT to CONLL... Done.
Converting CONLL from BIO to BIOES format... Done.
Formatting test set from BRAT to CONLL... Done.
Converting CONLL from BIO to BIOES format... Done.
Load dataset... done (18.63 seconds)
Load token embeddings... done (0.05 seconds)
number_of_token_original_case_found: 243
number_of_token_lowercase_found: 94
number_of_token_digits_replaced_with_zeros_found: 0
number_of_token_lowercase_and_digits_replaced_with_zeros_found: 0
number_of_loaded_word_vectors: 337
dataset.vocabulary_size: 339
Starting epoch 0
Training completed in 0.00 seconds
Evaluate model on the train set
processed 196 tokens with 22 phrases; found: 120 phrases; correct: 4.
accuracy: 5.61%; precision: 3.33%; recall: 18.18%; FB1: 5.63
LOC: precision: 0.00%; recall: 0.00%; FB1: 0.00 4
MISC: precision: 3.00%; recall: 50.00%; FB1: 5.66 100
ORG: precision: 0.00%; recall: 0.00%; FB1: 0.00 12
PER: precision: 25.00%; recall: 20.00%; FB1: 22.22 4
Evaluate model on the valid set
processed 196 tokens with 25 phrases; found: 73 phrases; correct: 0.
accuracy: 7.65%; precision: 0.00%; recall: 0.00%; FB1: 0.00
LOC: precision: 0.00%; recall: 0.00%; FB1: 0.00 8
MISC: precision: 0.00%; recall: 0.00%; FB1: 0.00 46
ORG: precision: 0.00%; recall: 0.00%; FB1: 0.00 17
PER: precision: 0.00%; recall: 0.00%; FB1: 0.00 2
Evaluate model on the test set
processed 195 tokens with 20 phrases; found: 91 phrases; correct: 2.
accuracy: 5.13%; precision: 2.20%; recall: 10.00%; FB1: 3.60
LOC: precision: 0.00%; recall: 0.00%; FB1: 0.00 7
MISC: precision: 2.99%; recall: 33.33%; FB1: 5.48 67
ORG: precision: 0.00%; recall: 0.00%; FB1: 0.00 14
PER: precision: 0.00%; recall: 0.00%; FB1: 0.00 3
Generating plots for the train set
Generating plots for the valid set
Generating plots for the test set
Starting epoch 1
Training completed in 0.28 seconds
Evaluate model on the train set
processed 196 tokens with 22 phrases; found: 0 phrases; correct: 0.
accuracy: 83.16%; precision: 0.00%; recall: 0.00%; FB1: 0.00
LOC: precision: 0.00%; recall: 0.00%; FB1: 0.00 0
MISC: precision: 0.00%; recall: 0.00%; FB1: 0.00 0
ORG: precision: 0.00%; recall: 0.00%; FB1: 0.00 0
PER: precision: 0.00%; recall: 0.00%; FB1: 0.00 0
Evaluate model on the valid set
processed 196 tokens with 25 phrases; found: 0 phrases; correct: 0.
accuracy: 84.18%; precision: 0.00%; recall: 0.00%; FB1: 0.00
LOC: precision: 0.00%; recall: 0.00%; FB1: 0.00 0
MISC: precision: 0.00%; recall: 0.00%; FB1: 0.00 0
ORG: precision: 0.00%; recall: 0.00%; FB1: 0.00 0
PER: precision: 0.00%; recall: 0.00%; FB1: 0.00 0
Evaluate model on the test set
processed 195 tokens with 20 phrases; found: 0 phrases; correct: 0.
accuracy: 84.62%; precision: 0.00%; recall: 0.00%; FB1: 0.00
LOC: precision: 0.00%; recall: 0.00%; FB1: 0.00 0
MISC: precision: 0.00%; recall: 0.00%; FB1: 0.00 0
PER: precision: 0.00%; recall: 0.00%; FB1: 0.00 0
Oh yeah, something is going wrong after the 0-th epoch.
How many examples are in each partition of your dataset?
Train: 3477 files Test: 412 files Valid: 614 files
All files are having average of 20 sentences in each. Everything is .txt and .ann format
Can you tell me how to add more entities? i have to train my data so it can detect courses - CSR, designation - DSG, etc i am trying to do it but getting an error
"Please ensure that only the following labels exist in the dataset: {0}".format(', '.join(self.unique_labels))) AssertionError: The label B-CSR does not exist in the pretraining dataset. Please ensure that only the following labels exist in the dataset: B-LOC, B-MISC, B-ORG, B-PER, E-LOC, E-MISC, E-ORG, E-PER, I-LOC, I-MISC, I-ORG, I-PER, O, S-LOC, S-MISC, S-ORG, S-PER
I would need more information to home in on the error. But in general, adding more entities simply involves labeling your training set and valid set (and possible test set) for those entities.
Did you make sure to label CSR and DSG entities in the train set?
@spate141 I have similar results on my own dataset. I guess this is happening due to the overwhelming of 'O' labels in the dataset, at least for my case. However, results are more reasonable after the 6th or 7th epoch. So be patient!
Is there a way to only use "-U" tag only? eg. "Software development" U-DSG U-DSG. and then train neuroner?
I would like to follow up on @Mrinal18 :s question. If I first have trained on a dataset which have the labels "a", "b" and "c", then I would like to use this model as a pre-trained model for my next data set that has labels "a", "b", "c" and "d". Is this possible, and in that case, how can I do that? Currently I have the same error as @Mrinal18 , but it does make sence, since "d" isn't in the pre-trained model.
I would like to follow up on @Mrinal18 :s question. If I first have trained on a dataset which have the labels "a", "b" and "c", then I would like to use this model as a pre-trained model for my next data set that has labels "a", "b", "c" and "d". Is this possible, and in that case, how can I do that? Currently I have the same error as @Mrinal18 , but it does make sence, since "d" isn't in the pre-trained model.
I have encountered the same problem recently. Have you solved it?Can you tell me your solution? Thank you!
I have encountered the same problem recently. Have you solved it?Can you tell me your solution? Thank you!
@JohnGiorgi Just if you want to see that output I am getting from running main.py with having everything in proper place!
Formatting train set from BRAT to CONLL... Done. Converting CONLL from BIO to BIOES format... Done. Formatting valid set from BRAT to CONLL... Done. Converting CONLL from BIO to BIOES format... Done. Formatting test set from BRAT to CONLL... Done. Converting CONLL from BIO to BIOES format... Done. Load dataset... done (18.63 seconds) Load token embeddings... done (0.05 seconds) number_of_token_original_case_found: 243 number_of_token_lowercase_found: 94 number_of_token_digits_replaced_with_zeros_found: 0 number_of_token_lowercase_and_digits_replaced_with_zeros_found: 0 number_of_loaded_word_vectors: 337 dataset.vocabulary_size: 339 Starting epoch 0 Training completed in 0.00 seconds Evaluate model on the train set processed 196 tokens with 22 phrases; found: 120 phrases; correct: 4. accuracy: 5.61%; precision: 3.33%; recall: 18.18%; FB1: 5.63 LOC: precision: 0.00%; recall: 0.00%; FB1: 0.00 4 MISC: precision: 3.00%; recall: 50.00%; FB1: 5.66 100 ORG: precision: 0.00%; recall: 0.00%; FB1: 0.00 12 PER: precision: 25.00%; recall: 20.00%; FB1: 22.22 4 Evaluate model on the valid set processed 196 tokens with 25 phrases; found: 73 phrases; correct: 0. accuracy: 7.65%; precision: 0.00%; recall: 0.00%; FB1: 0.00 LOC: precision: 0.00%; recall: 0.00%; FB1: 0.00 8 MISC: precision: 0.00%; recall: 0.00%; FB1: 0.00 46 ORG: precision: 0.00%; recall: 0.00%; FB1: 0.00 17 PER: precision: 0.00%; recall: 0.00%; FB1: 0.00 2 Evaluate model on the test set processed 195 tokens with 20 phrases; found: 91 phrases; correct: 2. accuracy: 5.13%; precision: 2.20%; recall: 10.00%; FB1: 3.60 LOC: precision: 0.00%; recall: 0.00%; FB1: 0.00 7 MISC: precision: 2.99%; recall: 33.33%; FB1: 5.48 67 ORG: precision: 0.00%; recall: 0.00%; FB1: 0.00 14 PER: precision: 0.00%; recall: 0.00%; FB1: 0.00 3 Generating plots for the train set Generating plots for the valid set Generating plots for the test set Starting epoch 1 Training completed in 0.28 seconds Evaluate model on the train set processed 196 tokens with 22 phrases; found: 0 phrases; correct: 0. accuracy: 83.16%; precision: 0.00%; recall: 0.00%; FB1: 0.00 LOC: precision: 0.00%; recall: 0.00%; FB1: 0.00 0 MISC: precision: 0.00%; recall: 0.00%; FB1: 0.00 0 ORG: precision: 0.00%; recall: 0.00%; FB1: 0.00 0 PER: precision: 0.00%; recall: 0.00%; FB1: 0.00 0 Evaluate model on the valid set processed 196 tokens with 25 phrases; found: 0 phrases; correct: 0. accuracy: 84.18%; precision: 0.00%; recall: 0.00%; FB1: 0.00 LOC: precision: 0.00%; recall: 0.00%; FB1: 0.00 0 MISC: precision: 0.00%; recall: 0.00%; FB1: 0.00 0 ORG: precision: 0.00%; recall: 0.00%; FB1: 0.00 0 PER: precision: 0.00%; recall: 0.00%; FB1: 0.00 0 Evaluate model on the test set processed 195 tokens with 20 phrases; found: 0 phrases; correct: 0. accuracy: 84.62%; precision: 0.00%; recall: 0.00%; FB1: 0.00 LOC: precision: 0.00%; recall: 0.00%; FB1: 0.00 0 MISC: precision: 0.00%; recall: 0.00%; FB1: 0.00 0 PER: precision: 0.00%; recall: 0.00%; FB1: 0.00 0
I have encountered the same problem recently.; with20 epoch; it stop at epoch 9; early stop! Finishing the experiment Have you solved it?Can you tell me your solution? Thank you!
Hi, I want to train the model again on my own dataset. But can't find any source as you guys are discussing to train using running command Python main. But didn't find main file there. Can anyone explain how can I train NeuoroNER on my own Conll dataset. I have split the data set and uploaded on respected folder/created new folder in data\conll\mydata and also made change in parameters.ini file dataset_text_folder line path.
Any help would be appreciated.
This is the main file. This repo is out of date and you will have better luck finding another alternative to train your custom NER model. Try spaCy or something from that list which suits your requirements.
Thank you so much for reverting . Yes you are right it is out of date. I did the same thin renamed the main.py to main.py and ran the code command "python main.py" as i mentioned i already uploaded my data set (train, test, valid, deploy) in conll folder. Also made changes in parameters.in file but while running main file it gives attribute error of "sess"
Yeah i did followed spacy, but my end goal is quite different. Let me tell you in details.
Aim is to find PHI information from medical dataset. Like patient name, telephone, Dr name, hospital name. Etc. Labelled data using label-studio. Exported conll format data. Data be like "Robin -X- _ B-PATIENT_NAME". So while exploring find this repo. Thought of using it. But as i said getting attribute error of "sess"
TF version is - 1.13.0
Gone through other issues, they advised to import distutils but it is not working.
If you could please assist.
@Aj-232425 Which TF version do you have? As far as i remember NeuroNER is in TFv1 and latest version is TFv2
Try to import this in main.py file and then run it.
import tensorflow.compat.v1 as tf tf.disable_v2_behavior()
Also, as @spate141 recommend, try using Spacy for adding custom NER labels
Here is an example code for your usecase (this is an example code, please modify it accordingly)
import spacy
import random
from spacy.training.example import Example
# load the blank English model
nlp = spacy.blank("en")
# create a new entity recognizer for our custom labels
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner, last=True)
# define our custom labels (e.g. "PATIENT_NAME", "DR_NAME", "HOSPITAL_NAME")
labels = ["PATIENT_NAME", "DR_NAME", "HOSPITAL_NAME", "TELEPHONE"]
# load the labelled data in CoNLL format
with open("data.conll", "r") as f:
data = f.read().strip().split("\n\n")
# convert the labelled data into spaCy training examples
examples = []
for sample in data:
words, labels = [], []
for line in sample.split("\n"):
parts = line.split()
if len(parts) == 3:
words.append(parts[0])
labels.append(parts[2])
examples.append(Example.from_dict(nlp.make_doc(" ".join(words)), {"entities": [(i, j, k) for i, j, k in zip(range(len(words)), range(1, len(words) + 1), labels)]}))
# add the labels to the NER model
for label in labels:
ner.add_label(label)
# train the NER model
nlp.begin_training()
for i in range(10):
random.shuffle(examples)
for example in examples:
nlp.update([example], losses={})
# test the NER model on a sample text
doc = nlp("Robin called Dr. Smith at 555-1234 to make an appointment at St. Mary's Hospital.")
print([(ent.text, ent.label_) for ent in doc.ents])