biobert
biobert copied to clipboard
NER detokenize index error
Hi Authors, I am trying to fine tune BioBERT for NER task. I am using datasets from BioNLP challenges. I am having two issues:
ISSUE-1
I see thousands of such warnings from ner_detokenize.py
:
## The predicted sentence of BioBERT model looks like trimmed. (The Length of the tokenized input sequence is longer than max_seq_length); Filling O label instead.
-> Showing 10 words near skipped part : x C57BL / 6 ) F1 mice . [SEP] [CLS] We
I checked the sentences and their length is not longer than max_seq_length
value (max_seq_length=256) as the warning says. For example, here is sentence that is referred to in the warning above:
[CLS]
The
trans
##gene
was
pu
##rified
and
injected
into
C
##5
##7
##BL
/
6
##J
x
CB
##A
F1
z
##y
##got
##es
.
[SEP]
Please tell me what could be causing these warnings?
ISSUE-2
The ner_detokenize.py
is throwing this error:
idx: 179999 offset: 116302
idx: 180000 offset: 116302
idx: 180001 offset: 116302
Traceback (most recent call last):
File "biocodes/ner_detokenize.py", line 159, in <module>
transform2CoNLLForm(golden_path=args.answer_path, output_dir=args.output_dir, bert_pred=bert_pred, debug=args.debug)
File "biocodes/ner_detokenize.py", line 137, in transform2CoNLLForm
if ans['labels'][idx+offset] != '[SEP]':
IndexError: list index out of range
Can you please me what's causing this error?
I will really appreciate your help
AK
Here are my commands:
# predict labels for test set
mkdir $OUTPUT_DIR
python3 run_ner.py \
--do_train=false \
--do_predict=true \
--vocab_file=$BIOBERT_DIR/vocab.txt \
--bert_config_file=$BIOBERT_DIR/bert_config.json \
--init_checkpoint=$TRAINED_CLASSIFIER \
--data_dir=$NER_DIR/ \
--max_seq_length=256 \
--output_dir=$OUTPUT_DIR
## compute entity level performance
python3 biocodes/ner_detokenize.py \
--token_test_path=$OUTPUT_DIR/token_test.txt \
--label_test_path=$OUTPUT_DIR/label_test.txt \
--answer_path=$NER_DIR/test.tsv \
--output_dir=$OUTPUT_DIR
Hi AK, The error (ISSUE-1) is raised when ner_detokenizer.py cannot revert the original sentence. (Raised when reverted sentence != original pre-processed sentence from answer_path)
It seems like that the tokenization of the original dataset is not compatible with the BERT BPE tokenizer.
I am not sure since I wasn't able to check the original dataset and your pre-processed dataset, but /
near C57BL / 6
seems to be a weak part.
You need to add extra spacing before and after /
so that it can have an independent line in the original sentence (Check $NER_DIR/test.tsv).
Also, #107 may help you.
ISSUE-2 will be resolved when ISSUE-1 gets solved.
PS) I am thinking of releasing a tokenization code in the near future. Unfortunately, I (and so as the other authors) am extremely busy due to this difficult situation (COVID19) and do not have enough time to respond.
We will get back to the topic soon! Thanks and take care Wonjin
Hi Wonjin, Thanks for much for prompt help. I used SpaCy to tokenize (i.e. preprocess) my dataset. I think there might be two separate issues that's causing the problem. Please see below.
Issue with my preprocessing
## The predicted sentence of BioBERT model looks like trimmed. (The Length of the tokenized input sequence is longer than max_seq_length); Filling O label instead.
-> Showing 10 words near skipped part : diseases comprise Parkinson 's disease , Huntington 's disease and Alzheimer
I tokenized this sentence using BioBERT's BasicTokenizer
and FullTokenizer
. Then compared it with output from my preprocessing as well as with the token_test.txt
generated during predictions (inferencing) from BioBert.
My Tokenizer | BioBert Basic Tokenizer | BioBert Full Tokenizer | Tokens_test.txt |
---|---|---|---|
In | ['In', | ['In', | In |
addition | 'addition', | 'addition', | addition |
, | ',', | ',', | , |
the | 'the', | 'the', | the |
compound | 'compound', | 'compound', | compound |
of | 'of', | 'of', | of |
the | 'the', | 'the', | the |
invention | 'invention', | 'invention', | invention |
can | 'can', | 'can', | can |
be | 'be', | 'be', | be |
used | 'used', | 'used', | used |
for | 'for', | 'for', | for |
preparing | 'preparing', | 'preparing', | preparing |
medicines | 'medicines', | 'medicines', | medicines |
for | 'for', | 'for', | for |
preventing | 'preventing', | 'preventing', | preventing |
or | 'or', | 'or', | or |
treating | 'treating', | 'treating', | treating |
neurodegenerative | 'neurodegenerative', | 'ne', | ne |
diseases | 'diseases', | '##uro', | ##uro |
caused | 'caused', | '##de', | ##de |
by | 'by', | '##gene', | ##gene |
free | 'free', | '##rative', | ##rative |
radicals | 'radicals', | 'diseases', | diseases |
oxidative | 'oxidative', | 'caused', | caused |
damage | 'damage', | 'by', | by |
, | ',', | 'free', | free |
wherein | 'wherein', | 'radical', | radical |
the | 'the', | '##s', | ##s |
neurodegenerative | 'neurodegenerative', | 'o', | o |
diseases | 'diseases', | '##xi', | ##xi |
comprise | 'comprise', | '##da', | ##da |
Parkinson | 'Parkinson', | '##tive', | ##tive |
's | "'", | 'damage', | damage |
disease | 's', | ',', | , |
, | 'disease', | 'wherein', | wherein |
Huntington | ',', | 'the', | the |
's | 'Huntington', | 'ne', | ne |
disease | "'", | '##uro', | ##uro |
and | 's', | '##de', | ##de |
Alzheimer | 'disease', | '##gene', | ##gene |
's | 'and', | '##rative', | ##rative |
disease | 'Alzheimer', | 'diseases', | diseases |
. | "'", | 'comprise', | comprise |
's', | 'Parkinson', | Parkinson | |
'disease', | "'", | ' | |
'.'] | 's', | s | |
'disease', | disease | ||
',', | , | ||
'Huntington', | Huntington | ||
"'", | ' | ||
's', | s | ||
'disease', | disease | ||
'and', | and | ||
'Alzheimer', | Alzheimer | ||
"'", | ' | ||
's', | s | ||
'disease', | disease | ||
'.'] | . |
Tokens from from my preprocessing do not match with BioBert's BasicTokenizer
. This is the problem that you mentioned in your last comment.
Question-1: Do you think replacing spaCy
with BasicTokenizer
in my workflow will solve this issue?
Issue with ner_detokenize.py
## The predicted sentence of BioBERT model looks like trimmed. (The Length of the tokenized input sequence is longer than max_seq_length); Filling O label instead.
-> Showing 10 words near skipped part : Compared with similar compounds , the phenylamine acid compound has good
Here are tokens from different tokenizers and token_test.txt
for this sentence
My tokenizer | BioBert BasicTokenizer | BioBert FullTokenizer | Tokens_test.txt |
---|---|---|---|
Compared | ['Compared', | ['Compared', | Compared |
with | 'with', | 'with', | with |
similar | 'similar', | 'similar', | similar |
compounds | 'compounds', | 'compounds', | compounds |
, | ',', | ',', | , |
the | 'the', | 'the', | the |
phenylamine | 'phenylamine', | 'p', | p |
acid | 'acid', | '##hen', | ##hen |
compound | 'compound', | '##yla', | ##yla |
has | 'has', | '##mine', | ##mine |
good | 'good', | 'acid', | acid |
effect | 'effect', | 'compound', | compound |
of | 'of', | 'has', | has |
inducing | 'inducing', | 'good', | good |
the | 'the', | 'effect', | effect |
activation | 'activation', | 'of', | of |
of | 'of', | 'in', | in |
HIV | 'HIV', | '##ducing', | ##ducing |
latent | 'latent', | 'the', | the |
cells | 'cells', | 'activation', | activation |
, | ',', | 'of', | of |
and | 'and', | 'HIV', | HIV |
mainly | 'mainly', | 'late', | late |
has | 'has', | '##nt', | ##nt |
low | 'low', | 'cells', | cells |
toxicity | 'toxicity', | ',', | , |
to | 'to', | 'and', | and |
cells | 'cells', | 'mainly', | mainly |
. | '.'] | 'has', | has |
'low', | low | ||
'toxicity', | toxicity | ||
'to', | to | ||
'cells', | cells | ||
'.'] | . |
In this case, tokens from my preprocessing workflow matches with those from BioBert's BasicTokenizer
and they match with the tokens_test.txt
from predictions if detokenized properly.
Question-2: Do you think there is a bug in ner_detokenize.py
? It's making mistake in reconstructing the original tokens and there are no issues with my preprocessing?
Question-3: Only ner_detokenize
step is affected or dependent on the specific pre-processing workflow that your group uses? I mean run_ner.py
for fine-tuning and inferencing is independent of your pre-processing, and I can write my own detokenizer script to parse the predictions (token_test.txt
and label_test.txt
) and compute entity-level performance.
Hi @wonjininfo , I updated my preprocessing workflow. Now it uses BioBert's tokenizers. I guess that fixes "Issue with my processing," as mentioned in my last comment.
I still get the warnings and error as posted in my first comment. So, I guess the issue with "ner_detokenize," as mentioned in my last comment, still exists. Any thoughts?
AK
@atulkakrana I am having similar problems when I fine-tune on multiple entities. Did you find a workaround? Can you share, please?
Hi all, Would you check this comment? : https://github.com/dmis-lab/biobert/issues/107#issuecomment-615558492
Hi, the pre-processing of the datasets was mostly done by other co-authors. I tried nltk for my other project, but it seems like nltk is not compatible with the BERT tokenizer (especially near special characters). So I get tokenizer code from this repository by co-authors and modified it for my own use (see the end of this comment for the modified code).