transformers icon indicating copy to clipboard operation
transformers copied to clipboard

A basic NLP question regarding NER task

Open liususan091219 opened this issue 3 years ago • 2 comments

Hi @patrickvonplaten, I have one basic conceptual NLP question regarding the evaluation for NER.

According to run_ner.py, the ground truth label is truncated to max_seq_length during prediction. However, this means the ground truth label will be changed.

My question is: is the prediction still valid? For example, if the ground truth has 150 tokens, when max_seq_length = 128, both the prediction and label are truncated to 128, isn't the prediction required to contain 150 tokens for consistency of evaluation?

Thank you very much in advance for your help, apologize if this is the wrong place for posting.

liususan091219 avatar Jul 06 '22 23:07 liususan091219

hello @liususan091219 . I believe the logic you are looking for is the tokenize_and_align_labels function. Looks like the logic is as following:

  • if len(labels) > len(tokens), then the extra labels are thrown out to ensure the same length
  • if len(labels) < len(tokens) (due to padding), then the extra labels with value of -100 are added to ensure they have the same length

sijunhe avatar Jul 07 '22 08:07 sijunhe

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Aug 06 '22 15:08 github-actions[bot]