transformers icon indicating copy to clipboard operation
transformers copied to clipboard

NER pipeline adding unnecessary spaces to extracted entities

Open SergeyShk opened this issue 1 year ago • 6 comments

System Info

  • transformers version: 4.27.2
  • Platform: macOS-13.1-x86_64-i386-64bit
  • Python version: 3.10.9
  • Huggingface_hub version: 0.13.3
  • PyTorch version (GPU?): 2.0.0 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help?

@Narsil @ArthurZucker

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

I have been using the NER pipeline of transformers to extract named entities from text. However, I have noticed that in some cases, the pipeline adds unnecessary spaces to the extracted entities, which can cause issues downstream.

For example, when I input the message "Pay 04-00-04", the pipeline extracts the following entity:

tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
model = AutoModelForTokenClassification.from_pretrained(MODEL_DIR)
pipe = pipeline(
            "token-classification",
            model=model,
            tokenizer=tokenizer,
            accelerator="bettertransformer",
            aggregation_strategy="first",
)
pipe("Pay 04-00-04")

{
  "entity":"CODE",
  "word":"04 - 00 - 04",
  "start":4,
  "end":12
}

As you can see, the entity includes spaces between the hyphens, which is not correct. This can cause problems when I want to use the extracted entity in further processing, such as database lookups or machine learning models.

I have tested the pipeline on different messages and have found that it consistently adds spaces to some entities. This issue seems to be related to the tokenizer used by the pipeline, which splits the text into tokens before feeding it to the NER model.

Thank you for your attention to this matter.

Expected behavior

I would expect to see entiities without unnecessary spaces:

{
  "entity":"CODE",
  "word":"04-00-04",
  "start":4,
  "end":12
}

SergeyShk avatar Mar 23 '23 12:03 SergeyShk

Hi @SergeyShk .

This is linked to how tokenizers work, and there's nothing to be done about it (there's no difference for it if there was a space or not, so during decoding it can arbitrarily choose to put it or not.).

However, you do have start and stop which can help you recover the exact original string within your text. Would that be enough ?

Narsil avatar Mar 23 '23 14:03 Narsil

I definitely could use start and stop manually, but why aren't they used in pipeline to get word?

SergeyShk avatar Mar 23 '23 15:03 SergeyShk

Legacy.

This was created before using tokenizers library, and therefore offsets where not even an option. So indexing back was not possible. Since we're keen to never break compatiblity (until 5.0) it's saying there.

Someone suggested to add yet another key like better_word which would contain it, but we decided against it, since it's even more confusing.

Narsil avatar Mar 23 '23 17:03 Narsil

word is also always in lower case. But ok, I get you, will use start and stop then. Thanks.

SergeyShk avatar Mar 23 '23 18:03 SergeyShk

word is also always in lower case

This depends on the tokenizer.

Narsil avatar Mar 23 '23 22:03 Narsil

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Apr 22 '23 15:04 github-actions[bot]