transformers
transformers copied to clipboard
NER pipeline adding unnecessary spaces to extracted entities
System Info
-
transformers
version: 4.27.2 - Platform: macOS-13.1-x86_64-i386-64bit
- Python version: 3.10.9
- Huggingface_hub version: 0.13.3
- PyTorch version (GPU?): 2.0.0 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help?
@Narsil @ArthurZucker
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
I have been using the NER pipeline of transformers to extract named entities from text. However, I have noticed that in some cases, the pipeline adds unnecessary spaces to the extracted entities, which can cause issues downstream.
For example, when I input the message "Pay 04-00-04", the pipeline extracts the following entity:
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
model = AutoModelForTokenClassification.from_pretrained(MODEL_DIR)
pipe = pipeline(
"token-classification",
model=model,
tokenizer=tokenizer,
accelerator="bettertransformer",
aggregation_strategy="first",
)
pipe("Pay 04-00-04")
{
"entity":"CODE",
"word":"04 - 00 - 04",
"start":4,
"end":12
}
As you can see, the entity includes spaces between the hyphens, which is not correct. This can cause problems when I want to use the extracted entity in further processing, such as database lookups or machine learning models.
I have tested the pipeline on different messages and have found that it consistently adds spaces to some entities. This issue seems to be related to the tokenizer used by the pipeline, which splits the text into tokens before feeding it to the NER model.
Thank you for your attention to this matter.
Expected behavior
I would expect to see entiities without unnecessary spaces:
{
"entity":"CODE",
"word":"04-00-04",
"start":4,
"end":12
}
Hi @SergeyShk .
This is linked to how tokenizers work, and there's nothing to be done about it (there's no difference for it if there was a space or not, so during decoding it can arbitrarily choose to put it or not.).
However, you do have start
and stop
which can help you recover the exact original string within your text.
Would that be enough ?
I definitely could use start
and stop
manually, but why aren't they used in pipeline to get word
?
Legacy.
This was created before using tokenizers
library, and therefore offsets
where not even an option. So indexing back was not possible. Since we're keen to never break compatiblity (until 5.0) it's saying there.
Someone suggested to add yet another key like better_word
which would contain it, but we decided against it, since it's even more confusing.
word
is also always in lower case. But ok, I get you, will use start
and stop
then. Thanks.
word is also always in lower case
This depends on the tokenizer.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.