numerizer icon indicating copy to clipboard operation
numerizer copied to clipboard

Issue with numerize extension in spaCy version 3.6.1

Open hudrizzle opened this issue 9 months ago • 0 comments

Problem:

We have been using spaCy along with the numerize extension successfully to extract money amounts in string format and convert them into integers. However, after upgrading from spaCy version 3.5.0 to 3.6.1, we are experiencing an issue to reproduce previous result for a specific pattern.

import spacy

nlp = spacy.load("en_core_web_trf")
amount = nlp("55  thousand")._.numerize() #  two spaces between 55 and thousand
print(amount)

Expected result in spaCy 3.5.0 - {55 thousand: '55000'} Different result in spaCy 3.6.1 - {55 : '55 '}

Above code was able to correctly extract the money amount as an integer (i.e., 55000) in the old spaCy version. However, after upgrading spaCy and en_core_web_trf version to 3.6.1, it fails to pick up the amount if there are two spaces in between the digits.

Environment:

spaCy Version: 3.6.1 docker image: python:3.11.4-slim-bullseye

Please let me know if this is the right place to raise this issue. I can move this to spaCy Github repo if that's more appropriate.

hudrizzle avatar Sep 25 '23 20:09 hudrizzle