numerizer
numerizer copied to clipboard
Issue with numerize extension in spaCy version 3.6.1
Problem:
We have been using spaCy along with the numerize extension successfully to extract money amounts in string format and convert them into integers. However, after upgrading from spaCy version 3.5.0
to 3.6.1
, we are experiencing an issue to reproduce previous result for a specific pattern.
import spacy
nlp = spacy.load("en_core_web_trf")
amount = nlp("55 thousand")._.numerize() # two spaces between 55 and thousand
print(amount)
Expected result in spaCy 3.5.0 - {55 thousand: '55000'} Different result in spaCy 3.6.1 - {55 : '55 '}
Above code was able to correctly extract the money amount as an integer (i.e., 55000) in the old spaCy version. However, after upgrading spaCy and en_core_web_trf version to 3.6.1, it fails to pick up the amount if there are two spaces in between the digits.
Environment:
spaCy Version: 3.6.1 docker image: python:3.11.4-slim-bullseye
Please let me know if this is the right place to raise this issue. I can move this to spaCy Github repo if that's more appropriate.