number-parser
number-parser copied to clipboard
Ordinal number support
Enhacing library I have made changes in the base structure of the files to support ordinals.
Hi @noviluni sorry for the delayed response , I had composed a reply last night but forgot to hit send it seems. 😬
- I did consider using the approach
ordinal_numbers -> cardinal_numbers -> number
, however I thought of certain cases that might give some issue. If we have something like
twentieth seventh fiftieth third
this approach would give us27 53
as output. As opposed to20 7 50 3
. A presence of an ordinal number indicates the end of the number so this would need to be incorporated. This along with the need to create a complete mapping for other languages like Spanish made me feel that this expanded structure (with same categories for ordinal numbers) would be more robust in the long term.
However if the primary goal is to have a basic integration with date-parser , definitely we can have some sort of mapping in place as suggested so that most of the dates and simple cases in English work. I will create a PR with the code that you have given.
- Thanks for the inputs with respect to long scale implementation , I will create a separate PR with the suggested changes. I did change the variable because it was boolean , however wanted to change it to
IS_LONG_SCALE
didn't realize I had missed thescale
. Going forwardUSE_LONG_SCALE
seems to be the most verbose and accurate one , so will use it.
If we have something like
twentieth seventh fiftieth third
this approach would give us27 53
as output. As opposed to20 7 50 3
Hmm... that's only true if you implement the parse()
method, but I was talking about the parse_ordinal
method and this is not happening.
Let's see (using the function I wrote):
>>> parse_ordinal('twentieth')
20
>>> parse_ordinal('twentieth seventh')
None
>>> parse_ordinal('twenty seventh')
27
So, this is what I want you to do:
Open a new draft PR and add these new functions:
def is_cardinal_token(token, language):
... # Using the LanguageData for the language detects if the word is part of a number
def is_ordinal_token(token, language):
token = _apply_cardinal_converstion(input_string, language)
retutrn is_ordinal_token(token, language)
def is_number_token(token, language):
return is_cardinal_token(token, language) or is_ordinal_token(token, language)
def parse_ordinal(input_string, language='en'):
_apply_cardinal_converstion(input_string, language)
return parse_number(input_string)
def _apply_cardinal_converstion(input_string, language):
CARDINAL_DIRECT_NUMBERS = {'first': 'one', 'second': 'two', 'third': 'three', 'fifth': 'five', 'eighth': 'eight', 'ninth': 'nine', 'twelfth': 'twelve'} # thiw will be coming from the LanguageData
input_string = input_string.lower()
for word, number in CARDINAL_DIRECT_NUMBERS.items():
input_string = input_string.replace(word, number)
input_string = re.sub(r'ieth$', 'y', input_string)
input_string = re.sub(r'th$', '', input_string)
return input_string
We will rewrite completely the _apply_cardinal_converstion()
but for now it's ok.
After building this, we will change the parse()
method to use directly the parse_number()
and parse_ordinal()
instead of the _build_number()
by checking the tokens with is_number_token()
(and/or is_cardinal_token()
and is_ordinal_token()
) and we will refactor the _build_number()
too, but let's discuss this after submitting the draft PR.
I think this approach will be better for testing and will ease a lot the integration with dateparser
. :smile: