Infixes Update Not Applying Properly to Tokenizer

Open Rayan-Allali opened this issue 9 months ago • 0 comments

Infixes Update Not Applying Properly to Tokenizer

Description

I tried updating the infix patterns in spaCy, but the changes are not applying correctly to the tokenizer. Specifically, I'm trying to modify how apostrophes and other symbols ( ') are handled. However, even after setting a new regex, the tokenizer does not reflect these changes.

Steps to Reproduce

Here are the two approaches I tried:

1️⃣ Removing apostrophe-related rules from infixes and recompiling:

default_infixes = [pattern for pattern in nlp.Defaults.infixes if "'" not in pattern]
infix_re = compile_infix_regex(default_infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer

Issue: Even after modifying the infix rules, contractions like "can't" still split incorrectly.

2️⃣ Manually adding new infix rules (including hyphens, plus signs, and dollar signs):

infixes = nlp.Defaults.infixes + [r"'",]  
infixe_regex = spacy.util.compile_infix_regex(infixes)  
nlp.tokenizer.infix_finditer = infixe_regex.finditer

Expected Behavior

The tokenizer should correctly apply the new infix rules.

Actual Behavior

Changes to nlp.tokenizer.infix_finditer do not seem to take effect.

Question

Am I missing something in how infix rules should be updated? Is there a correct way to override infix splitting?

Thanks for your help!

Apr 02 '25 14:04 Rayan-Allali