Why is this tokenization weird?
On My MacPro Sequoria, spacy 3.7.5 with en_core_web_trf model, the following sentence:
How should writes to T012K and T012T be handled?
Is tokenized as:
T012K => T012 and K
T012T => T012 and T
Other tokens look fine. this is a very normal sentence and how can it split them?
Hi @lingvisa,
I replicated your issue below in a virtual environment:
import spacy
import spacy_transformers
nlp = spacy.load("en_core_web_trf")
text = "SAP Table T012K and Table T012T"
doc = nlp(text)
print([w.text for w in doc]) # ['SAP', 'Table', 'T012', 'K', 'and', 'Table', 'T012', 'T']
You can use the nlp.tokenizer.explain(text) method to provide information which tokenization pattern or rule is used for splitting. This returns the output:
[('TOKEN', 'SAP'),
('TOKEN', 'Table'),
('TOKEN', 'T012'),
('SUFFIX', 'K'),
('TOKEN', 'and'),
('TOKEN', 'Table'),
('TOKEN', 'T012'),
('SUFFIX', 'T')]
While you certainly can customize the patterns used by tokenizer here per the docs, you would need to modify the existing rule set (namely, the suffix rules) carefully.
In your domain however, your token should be considered a special case—the preferable solution would make use of the tokenizer.add_special_case() method:
import spacy
import spacy_transformers
from spacy.attrs import ORTH
# Add special case rule
special_cases = ["T012K", "T012T"]
for case in special_cases:
nlp.tokenizer.add_special_case(case, [{ORTH: case}])
# Check new tokenization
doc = nlp("SAP Table T012K and Table T012T")
print([w.text for w in doc]) # ['SAP', 'Table', 'T012K', 'and', 'Table', 'T012T']
See these docs for more information. The future versions seem to handle this problem also.
Notably, if you use the nlp.tokenizer.explain(text) method again, the output will now be:
[('TOKEN', 'SAP'),
('TOKEN', 'Table'),
('SPECIAL-1', 'T012K'),
('TOKEN', 'and'),
('TOKEN', 'Table'),
('SPECIAL-1', 'T012T')]
Hope it helps.