portuguese-bert
portuguese-bert copied to clipboard
Error training for instances with only numbers.
I found an error in the code that is related to an output length issue in the get_example_output function in the postprocessing.py file. The specific error is an AssertionError that occurs when the code tries to verify whether the length of the output (complete_output) matches the length of the document tokens for an example in which I only have numbers.
Just with instances like that the assertion failed.
{
"doc_id": "TEST-205",
"doc_text": "3123 0346 2154 8600 0186 5500 1000 0001 6015 3585 0960",
"entities": [
{
"entity_id": 0,
"text": "3123 0346 2154 8600 0186 5500 1000 0001 6015 3585 0960",
"label": "NUMEROS_OUTROS",
"start_offset": 0,
"end_offset": 54
}
]
}
Maybe you can give me some insight. Thank you.
The error:
File "D:\Anonimização\NER\postprocessing.py", line 157, in get_example_output
assert len(complete_output) == len(self.examples[example_ix].doc_tokens), \
AssertionError: Length mismatch for example 169: [ 0 0 0 3 4 4 4 4 9 10 10 10 10 10 9 10 10 10 10 10] !=
11 in example 169:
doc_id: TEST-205
orig_text:3123 0346 2154 8600 0186 5500 1000 0001 6015 3585 0960
doc_tokens: [Token(text='3123', offset=0, index=0, tail=' ', tag=None), Token(text='0346', offset=5, index=1, tail=' ', tag=None), Token(text='2154', offset=10, index=2, tail=' ', tag=None), Token(text='8600', offset=15, index=3, tail=' ', tag=None), Token(text='0186', offset=20, index=4, tail=' ', tag=None), Token(text='5500', offset=25, index=5, tail=' ', tag=None), Token(text='1000', offset=30, index=6, tail=' ', tag=None), Token(text='0001', offset=35, index=7, tail=' ', tag=None), Token(text='6015', offset=40, index=8, tail=' ', tag=None), Token(text='3585', offset=45, index=9, tail=' ', tag=None), Token(text='0960', offset=50, index=10, tail='', tag=None)]
labels: ['B-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS']
tags: [NETag(doc_id='HAREM-205', entity_id=0, text='3123 0346 2154 8600 0186 5500 1000 0001 6015 3585 0960', type='NUMEROS_OUTROS', start_position=0, end_position=10)]
[array([ 0, 0, 0, 3, 4, 4, 4, 4, 9, 10, 10, 10, 10, 10, 9, 10, 10, 10, 10, 10])]
@fabiocapsouza @rodrigonogueira4