portuguese-bert icon indicating copy to clipboard operation
portuguese-bert copied to clipboard

Error training for instances with only numbers.

Open romualdoalan opened this issue 9 months ago • 4 comments

I found an error in the code that is related to an output length issue in the get_example_output function in the postprocessing.py file. The specific error is an AssertionError that occurs when the code tries to verify whether the length of the output (complete_output) matches the length of the document tokens for an example in which I only have numbers.

Just with instances like that the assertion failed.

{
    "doc_id": "TEST-205",
    "doc_text": "3123 0346 2154 8600 0186 5500 1000 0001 6015 3585 0960",
    "entities": [
      {
        "entity_id": 0,
        "text": "3123 0346 2154 8600 0186 5500 1000 0001 6015 3585 0960",
        "label": "NUMEROS_OUTROS",
        "start_offset": 0,
        "end_offset": 54
      }
    ]
  }

Maybe you can give me some insight. Thank you.

The error:

File "D:\Anonimização\NER\postprocessing.py", line 157, in get_example_output
    assert len(complete_output) == len(self.examples[example_ix].doc_tokens), \
AssertionError: Length mismatch for example 169: [ 0  0  0  3  4  4  4  4  9 10 10 10 10 10  9 10 10 10 10 10] !=
             11 in example 169:

doc_id: TEST-205
orig_text:3123 0346 2154 8600 0186 5500 1000 0001 6015 3585 0960
doc_tokens: [Token(text='3123', offset=0, index=0, tail=' ', tag=None), Token(text='0346', offset=5, index=1, tail=' ', tag=None), Token(text='2154', offset=10, index=2, tail=' ', tag=None), Token(text='8600', offset=15, index=3, tail=' ', tag=None), Token(text='0186', offset=20, index=4, tail=' ', tag=None), Token(text='5500', offset=25, index=5, tail=' ', tag=None), Token(text='1000', offset=30, index=6, tail=' ', tag=None), Token(text='0001', offset=35, index=7, tail=' ', tag=None), Token(text='6015', offset=40, index=8, tail=' ', tag=None), Token(text='3585', offset=45, index=9, tail=' ', tag=None), Token(text='0960', offset=50, index=10, tail='', tag=None)]

labels: ['B-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS']

tags: [NETag(doc_id='HAREM-205', entity_id=0, text='3123 0346 2154 8600 0186 5500 1000 0001 6015 3585 0960', type='NUMEROS_OUTROS', start_position=0, end_position=10)]

[array([ 0,  0,  0,  3,  4,  4,  4,  4,  9, 10, 10, 10, 10, 10,  9, 10, 10, 10, 10, 10])]

@fabiocapsouza @rodrigonogueira4

romualdoalan avatar Sep 18 '23 15:09 romualdoalan