NeuroNER icon indicating copy to clipboard operation
NeuroNER copied to clipboard

Adding a new entity type

Open spate141 opened this issue 7 years ago • 16 comments

I have created a dataset of adding a new entity type, let's say it's "XYZ" entity type and I have combined the new train, valid, test data with original CoNLL data in BRAT format. If I run the main.py script on this new data, I am not getting proper results! Precision and recall are 0% and I am not even seeing this new "XYZ" entity type on those matrices. As per the documentation, I have train, valid and test folder in BRAT format as it's easy for me to create a dataset in BRAT format, is there anything that I am missing?

spate141 avatar Oct 23 '17 21:10 spate141

Are you able to upload the first few lines of one of your annotation files?

Otherwise, are you able to open your dataset with the Brat server, and see annotated entries as you expect to see them?

JohnGiorgi avatar Oct 24 '17 21:10 JohnGiorgi

@JohnGiorgi Attaching the snippet from train folder.. train_text_03477.txt

Compared to the Motorola H500 , this is a wonderful piece of gear although it is much larger .
I only buy HTC phones , and if this was " made by HTC " they had it contracted out .
Sure , it worked fast but I 'm bothered by what seems planned obsolescence by the manufacturers at Nokia .
But the graphics are awesome .. thak you NVIDIA .. for at least putting in good graphics ....
“ China and the U.S. are good partners now , with Boeing and Microsoft , ” Hu said .

train_text_03477.ann

T1	TECH 16 24	Motorola
T2	TECH 106 109	HTC
T3	TECH 145 148	HTC
T4	TECH 279 284	Nokia
T5	TECH 328 334	NVIDIA
T6	TECH 442 451	Microsoft

spate141 avatar Oct 24 '17 21:10 spate141

Okay... there is nothing obviously wrong with this. I would check a few things:

in the parameter file you provide when running NeuroNER, make sure this line:

dataset_text_folder = ../data/example

points to some folder (in this case ../data/example) with subdirectories train, valid and (optionally) test, e.g. ../data/example/train must exist.

Download BRAT (instructions for MacOS here, but there are instructions for Windows and Linux on this repo as well). And load up your dataset. Make sure the BRAT web-server is not throwing any errors or complaining about your dataset, and make sure the annotations appear as expected.

Other than that, I am not sure what is going wrong. Are you able to run NeuroNER successfully with one of the given datasets?

JohnGiorgi avatar Oct 24 '17 21:10 JohnGiorgi

screen shot 2017-10-24 at 4 42 01 pm

@JohnGiorgi I can see the annotations, and they looks correct! Also, I can run the scripts without any errors, it's just I am not getting any results back!

Thanks for the reply anyway!

spate141 avatar Oct 24 '17 21:10 spate141

Very weird...

Someone who knows much more about NeuroNERs inner workings will have to help you!

JohnGiorgi avatar Oct 24 '17 21:10 JohnGiorgi

I was actually going through all the scripts and trying to debug the issue here, it's just seems little complex given the structure of the repo. Will post the solution if I'm able to fix this issue!

spate141 avatar Oct 24 '17 21:10 spate141

@JohnGiorgi Just if you want to see that output I am getting from running main.py with having everything in proper place!

Formatting train set from BRAT to CONLL... Done.
Converting CONLL from BIO to BIOES format... Done.
Formatting valid set from BRAT to CONLL... Done.
Converting CONLL from BIO to BIOES format... Done.
Formatting test set from BRAT to CONLL... Done.
Converting CONLL from BIO to BIOES format... Done.
Load dataset... done (18.63 seconds)
Load token embeddings... done (0.05 seconds)
number_of_token_original_case_found: 243
number_of_token_lowercase_found: 94
number_of_token_digits_replaced_with_zeros_found: 0
number_of_token_lowercase_and_digits_replaced_with_zeros_found: 0
number_of_loaded_word_vectors: 337
dataset.vocabulary_size: 339

Starting epoch 0
Training completed in 0.00 seconds
Evaluate model on the train set
processed 196 tokens with 22 phrases; found: 120 phrases; correct: 4.
accuracy:   5.61%; precision:   3.33%; recall:  18.18%; FB1:   5.63
              LOC: precision:   0.00%; recall:   0.00%; FB1:   0.00  4
             MISC: precision:   3.00%; recall:  50.00%; FB1:   5.66  100
              ORG: precision:   0.00%; recall:   0.00%; FB1:   0.00  12
              PER: precision:  25.00%; recall:  20.00%; FB1:  22.22  4

Evaluate model on the valid set
processed 196 tokens with 25 phrases; found: 73 phrases; correct: 0.
accuracy:   7.65%; precision:   0.00%; recall:   0.00%; FB1:   0.00
              LOC: precision:   0.00%; recall:   0.00%; FB1:   0.00  8
             MISC: precision:   0.00%; recall:   0.00%; FB1:   0.00  46
              ORG: precision:   0.00%; recall:   0.00%; FB1:   0.00  17
              PER: precision:   0.00%; recall:   0.00%; FB1:   0.00  2

Evaluate model on the test set
processed 195 tokens with 20 phrases; found: 91 phrases; correct: 2.
accuracy:   5.13%; precision:   2.20%; recall:  10.00%; FB1:   3.60
              LOC: precision:   0.00%; recall:   0.00%; FB1:   0.00  7
             MISC: precision:   2.99%; recall:  33.33%; FB1:   5.48  67
              ORG: precision:   0.00%; recall:   0.00%; FB1:   0.00  14
              PER: precision:   0.00%; recall:   0.00%; FB1:   0.00  3

Generating plots for the train set
Generating plots for the valid set
Generating plots for the test set

Starting epoch 1
Training completed in 0.28 seconds
Evaluate model on the train set
processed 196 tokens with 22 phrases; found: 0 phrases; correct: 0.
accuracy:  83.16%; precision:   0.00%; recall:   0.00%; FB1:   0.00
              LOC: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
             MISC: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
              ORG: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
              PER: precision:   0.00%; recall:   0.00%; FB1:   0.00  0

Evaluate model on the valid set
processed 196 tokens with 25 phrases; found: 0 phrases; correct: 0.
accuracy:  84.18%; precision:   0.00%; recall:   0.00%; FB1:   0.00
              LOC: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
             MISC: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
              ORG: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
              PER: precision:   0.00%; recall:   0.00%; FB1:   0.00  0

Evaluate model on the test set
processed 195 tokens with 20 phrases; found: 0 phrases; correct: 0.
accuracy:  84.62%; precision:   0.00%; recall:   0.00%; FB1:   0.00
              LOC: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
             MISC: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
              PER: precision:   0.00%; recall:   0.00%; FB1:   0.00  0

spate141 avatar Oct 24 '17 22:10 spate141

Oh yeah, something is going wrong after the 0-th epoch.

How many examples are in each partition of your dataset?

JohnGiorgi avatar Oct 24 '17 22:10 JohnGiorgi

Train: 3477 files Test: 412 files Valid: 614 files

All files are having average of 20 sentences in each. Everything is .txt and .ann format

spate141 avatar Oct 24 '17 22:10 spate141

Can you tell me how to add more entities? i have to train my data so it can detect courses - CSR, designation - DSG, etc i am trying to do it but getting an error

"Please ensure that only the following labels exist in the dataset: {0}".format(', '.join(self.unique_labels))) AssertionError: The label B-CSR does not exist in the pretraining dataset. Please ensure that only the following labels exist in the dataset: B-LOC, B-MISC, B-ORG, B-PER, E-LOC, E-MISC, E-ORG, E-PER, I-LOC, I-MISC, I-ORG, I-PER, O, S-LOC, S-MISC, S-ORG, S-PER

mrinal18 avatar Jan 24 '18 12:01 mrinal18

I would need more information to home in on the error. But in general, adding more entities simply involves labeling your training set and valid set (and possible test set) for those entities.

Did you make sure to label CSR and DSG entities in the train set?

JohnGiorgi avatar Jan 24 '18 13:01 JohnGiorgi

@spate141 I have similar results on my own dataset. I guess this is happening due to the overwhelming of 'O' labels in the dataset, at least for my case. However, results are more reasonable after the 6th or 7th epoch. So be patient!

geledek avatar Jan 30 '18 02:01 geledek

Is there a way to only use "-U" tag only? eg. "Software development" U-DSG U-DSG. and then train neuroner?

mrinal18 avatar Feb 08 '18 12:02 mrinal18

I would like to follow up on @Mrinal18 :s question. If I first have trained on a dataset which have the labels "a", "b" and "c", then I would like to use this model as a pre-trained model for my next data set that has labels "a", "b", "c" and "d". Is this possible, and in that case, how can I do that? Currently I have the same error as @Mrinal18 , but it does make sence, since "d" isn't in the pre-trained model.

JustusJL avatar Mar 16 '18 12:03 JustusJL

I would like to follow up on @Mrinal18 :s question. If I first have trained on a dataset which have the labels "a", "b" and "c", then I would like to use this model as a pre-trained model for my next data set that has labels "a", "b", "c" and "d". Is this possible, and in that case, how can I do that? Currently I have the same error as @Mrinal18 , but it does make sence, since "d" isn't in the pre-trained model.

I have encountered the same problem recently. Have you solved it?Can you tell me your solution? Thank you!

gorithms avatar Apr 20 '19 13:04 gorithms

I have encountered the same problem recently. Have you solved it?Can you tell me your solution? Thank you!

@JohnGiorgi Just if you want to see that output I am getting from running main.py with having everything in proper place!

Formatting train set from BRAT to CONLL... Done.
Converting CONLL from BIO to BIOES format... Done.
Formatting valid set from BRAT to CONLL... Done.
Converting CONLL from BIO to BIOES format... Done.
Formatting test set from BRAT to CONLL... Done.
Converting CONLL from BIO to BIOES format... Done.
Load dataset... done (18.63 seconds)
Load token embeddings... done (0.05 seconds)
number_of_token_original_case_found: 243
number_of_token_lowercase_found: 94
number_of_token_digits_replaced_with_zeros_found: 0
number_of_token_lowercase_and_digits_replaced_with_zeros_found: 0
number_of_loaded_word_vectors: 337
dataset.vocabulary_size: 339

Starting epoch 0
Training completed in 0.00 seconds
Evaluate model on the train set
processed 196 tokens with 22 phrases; found: 120 phrases; correct: 4.
accuracy:   5.61%; precision:   3.33%; recall:  18.18%; FB1:   5.63
              LOC: precision:   0.00%; recall:   0.00%; FB1:   0.00  4
             MISC: precision:   3.00%; recall:  50.00%; FB1:   5.66  100
              ORG: precision:   0.00%; recall:   0.00%; FB1:   0.00  12
              PER: precision:  25.00%; recall:  20.00%; FB1:  22.22  4

Evaluate model on the valid set
processed 196 tokens with 25 phrases; found: 73 phrases; correct: 0.
accuracy:   7.65%; precision:   0.00%; recall:   0.00%; FB1:   0.00
              LOC: precision:   0.00%; recall:   0.00%; FB1:   0.00  8
             MISC: precision:   0.00%; recall:   0.00%; FB1:   0.00  46
              ORG: precision:   0.00%; recall:   0.00%; FB1:   0.00  17
              PER: precision:   0.00%; recall:   0.00%; FB1:   0.00  2

Evaluate model on the test set
processed 195 tokens with 20 phrases; found: 91 phrases; correct: 2.
accuracy:   5.13%; precision:   2.20%; recall:  10.00%; FB1:   3.60
              LOC: precision:   0.00%; recall:   0.00%; FB1:   0.00  7
             MISC: precision:   2.99%; recall:  33.33%; FB1:   5.48  67
              ORG: precision:   0.00%; recall:   0.00%; FB1:   0.00  14
              PER: precision:   0.00%; recall:   0.00%; FB1:   0.00  3

Generating plots for the train set
Generating plots for the valid set
Generating plots for the test set

Starting epoch 1
Training completed in 0.28 seconds
Evaluate model on the train set
processed 196 tokens with 22 phrases; found: 0 phrases; correct: 0.
accuracy:  83.16%; precision:   0.00%; recall:   0.00%; FB1:   0.00
              LOC: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
             MISC: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
              ORG: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
              PER: precision:   0.00%; recall:   0.00%; FB1:   0.00  0

Evaluate model on the valid set
processed 196 tokens with 25 phrases; found: 0 phrases; correct: 0.
accuracy:  84.18%; precision:   0.00%; recall:   0.00%; FB1:   0.00
              LOC: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
             MISC: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
              ORG: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
              PER: precision:   0.00%; recall:   0.00%; FB1:   0.00  0

Evaluate model on the test set
processed 195 tokens with 20 phrases; found: 0 phrases; correct: 0.
accuracy:  84.62%; precision:   0.00%; recall:   0.00%; FB1:   0.00
              LOC: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
             MISC: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
              PER: precision:   0.00%; recall:   0.00%; FB1:   0.00  0

I have encountered the same problem recently.; with20 epoch; it stop at epoch 9; early stop! Finishing the experiment Have you solved it?Can you tell me your solution? Thank you!

trinh-hoang-hiep avatar May 29 '20 17:05 trinh-hoang-hiep

Hi, I want to train the model again on my own dataset. But can't find any source as you guys are discussing to train using running command Python main. But didn't find main file there. Can anyone explain how can I train NeuoroNER on my own Conll dataset. I have split the data set and uploaded on respected folder/created new folder in data\conll\mydata and also made change in parameters.ini file dataset_text_folder line path.

Any help would be appreciated.

Aj-232425 avatar Mar 29 '23 10:03 Aj-232425

This is the main file. This repo is out of date and you will have better luck finding another alternative to train your custom NER model. Try spaCy or something from that list which suits your requirements.

spate141 avatar Mar 29 '23 11:03 spate141

Thank you so much for reverting . Yes you are right it is out of date. I did the same thin renamed the main.py to main.py and ran the code command "python main.py" as i mentioned i already uploaded my data set (train, test, valid, deploy) in conll folder. Also made changes in parameters.in file but while running main file it gives attribute error of "sess"

Yeah i did followed spacy, but my end goal is quite different. Let me tell you in details.

Aim is to find PHI information from medical dataset. Like patient name, telephone, Dr name, hospital name. Etc. Labelled data using label-studio. Exported conll format data. Data be like "Robin -X- _ B-PATIENT_NAME". So while exploring find this repo. Thought of using it. But as i said getting attribute error of "sess"

TF version is - 1.13.0

Gone through other issues, they advised to import distutils but it is not working.

If you could please assist.

Aj-232425 avatar Mar 29 '23 12:03 Aj-232425

@Aj-232425 Which TF version do you have? As far as i remember NeuroNER is in TFv1 and latest version is TFv2

Try to import this in main.py file and then run it.

import tensorflow.compat.v1 as tf tf.disable_v2_behavior()

Also, as @spate141 recommend, try using Spacy for adding custom NER labels

Here is an example code for your usecase (this is an example code, please modify it accordingly)

import spacy
import random
from spacy.training.example import Example

# load the blank English model
nlp = spacy.blank("en")

# create a new entity recognizer for our custom labels
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner, last=True)

# define our custom labels (e.g. "PATIENT_NAME", "DR_NAME", "HOSPITAL_NAME")
labels = ["PATIENT_NAME", "DR_NAME", "HOSPITAL_NAME", "TELEPHONE"]

# load the labelled data in CoNLL format
with open("data.conll", "r") as f:
    data = f.read().strip().split("\n\n")

# convert the labelled data into spaCy training examples
examples = []
for sample in data:
    words, labels = [], []
    for line in sample.split("\n"):
        parts = line.split()
        if len(parts) == 3:
            words.append(parts[0])
            labels.append(parts[2])
    examples.append(Example.from_dict(nlp.make_doc(" ".join(words)), {"entities": [(i, j, k) for i, j, k in zip(range(len(words)), range(1, len(words) + 1), labels)]}))

# add the labels to the NER model
for label in labels:
    ner.add_label(label)

# train the NER model
nlp.begin_training()
for i in range(10):
    random.shuffle(examples)
    for example in examples:
        nlp.update([example], losses={})

# test the NER model on a sample text
doc = nlp("Robin called Dr. Smith at 555-1234 to make an appointment at St. Mary's Hospital.")
print([(ent.text, ent.label_) for ent in doc.ents])

mrinal18 avatar Mar 29 '23 16:03 mrinal18