dateparser
dateparser copied to clipboard
Added support for fractional units
Fixes #753
Added support for fractional units by making the regex pattern support digits before a decimal point and converting num to a float instead of an int on line 150.
What this changes
Before this fix, parse("in 0.5 hours") would return the time 5 hours in the future instead of 0.5 hours in the future.
With this fix, parsing fractional units are no longer parsed incorrectly.
Tests
>>> parse("now")
datetime.datetime(2021, 2, 7, 18, 50, 10, 107501)
>>> parse("in 0.5 hours")
datetime.datetime(2021, 2, 7, 19, 20, 15, 773800)
>>> parse("in 3.75 hours")
datetime.datetime(2021, 2, 7, 22, 35, 20, 463238)
>>> parse("2.5 hours ago")
datetime.datetime(2021, 2, 7, 16, 20, 29, 614937)
What was changed
To make it easier for future reviewers, the change is much smaller than it appears to be. The reason there are 125 files changed is because the regex in write_complete_data.py was updated and ran.
Only 3 lines of code were changed in this PR:
HI @DenverCoder1, thank you for this PR.
Could you add those tests you mention in the description in the project tests? Thanks!
@Gallaecio Added some tests for fractional units. Also added some support for numbers with commas instead of periods. Adding comma support was not as simple because of the way the Dictionary splits the strings.
I see you’ve had to modify the per-language Python files. Those are not meant to be modified manually, and are instead generated from other files in the repository. See https://dateparser.readthedocs.io/en/latest/contributing.html#guidelines-for-editing-translation-data . On the bright side, it may mean that you just need to edit the script that generates them.
@Gallaecio
I see. So I should update write_complete_data.py and the yaml files then rerun write_complete_data.py?
I am not yet familiar with that process, but I would assume you would have to edit either write_complete_data.py (hopefully) or the yaml files (if they include the \d pattern), and then rerun write_complete_data.py.
This is also assuming that you can get the desired result editing one of those two files. If the patterns came directly from CLDR (unlikely but possible), it may get messier.
Please have a look at en.yml and write_complete_data.py, see if you can figure out where the change should go. Otherwise, @noviluni is the person most familiar with the process, so he probably knows.
It seems to me that some of the data in the .py files is coming from the date_translation_data JSON files and some parts (such as decades) are coming from the supplementary YAML files. I made changes in both places to get the .py files to fully update.
The JSON files are not meant to be modified (they come from http://cldr.unicode.org/), so this may be not as trivial as I had hoped.
The JSON files are not meant to be modified (they come from http://cldr.unicode.org), so this may be not as trivial as I had hoped.
I didn't need to modify the JSON, only had to change the way it is converted in write_complete_data.py.
Looking at https://github.com/scrapinghub/dateparser/pull/859 maybe it’s just a matter or reopening.
Codecov Report
Base: 98.23% // Head: 98.23% // Increases project coverage by +0.00% :tada:
Coverage data is based on head (
5c5b3f1) compared to base (d052d3e). Patch coverage: 100.00% of modified lines in pull request are covered.
Additional details and impacted files
@@ Coverage Diff @@
## master #876 +/- ##
=======================================
Coverage 98.23% 98.23%
=======================================
Files 232 232
Lines 2602 2604 +2
=======================================
+ Hits 2556 2558 +2
Misses 46 46
| Impacted Files | Coverage Δ | |
|---|---|---|
| dateparser/data/date_translation_data/af.py | 100.00% <ø> (ø) |
|
| dateparser/data/date_translation_data/am.py | 100.00% <ø> (ø) |
|
| dateparser/data/date_translation_data/ar.py | 100.00% <ø> (ø) |
|
| dateparser/data/date_translation_data/ast.py | 100.00% <ø> (ø) |
|
| dateparser/data/date_translation_data/az-Latn.py | 100.00% <ø> (ø) |
|
| dateparser/data/date_translation_data/az.py | 100.00% <ø> (ø) |
|
| dateparser/data/date_translation_data/be.py | 100.00% <ø> (ø) |
|
| dateparser/data/date_translation_data/bg.py | 100.00% <ø> (ø) |
|
| dateparser/data/date_translation_data/bn.py | 100.00% <ø> (ø) |
|
| dateparser/data/date_translation_data/br.py | 100.00% <ø> (ø) |
|
| ... and 105 more |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
Hi @DenverCoder1 @Gallaecio Thank you for the PR.
I like the approach, however, I'm don't want to merge this as-is, because not all the locale/regions accept both (period and comma) as decimal separator. This could lead to some errors.
For example, if we have 1.500 s we don't know if it's 1 point 5 seconds or 1 thousand 500 seconds. In UK and US, they normally use comma (,) as a thousand separator, while in Spanish we usually use the period (.) as a thousand separator.
I tried with this code, and this is what I got:
>>> datetime.now() - dateparser.parse('1,5000 s', languages=['en'])
datetime.timedelta(0, 4999, 995131)
>>> datetime.now() - dateparser.parse('1.5000 s', languages=['en'])
datetime.timedelta(0, 1, 496419)
>>> datetime.now() - dateparser.parse('1,5000 s', languages=['es'])
datetime.timedelta(0, 4999, 996924)
>>> datetime.now() - dateparser.parse('1.5000 s', languages=['es'])
datetime.timedelta(0, 1, 496970)
I personally think that the decimal/comma separator should be configured by default by locale and we should probably be able to configure it manually using the settings, so this approach wouldn't probably work.
Let me know what you think / your ideas :)
I think you have a point, but I also imagine that those examples are interpreted as 5000 with master. If so, I imagine this change would not break anything that is parsed fine at the moment, but only extend support for fractional numbers without thousand separators.
If so, I would not mind merging as is and creating 2 separate tickets (1) for restricting supported decimal separators based on locale (we can hopefully use data from CLDR to choose the right ones) and (2) for supporting thousand separators.
But yes, we could also aim to solve all of that in this patch. I’m just worried about the extra time that could require.
@Gallaecio @noviluni What's the status on this? I don't think I have the time now to try making it locale-based as suggested. I agree that it could be made a separate PR.
As of now, it will make things such as "2.5 hours", etc. parse correctly when right now they don't ("2.5" gets treated as "5").
To make it work with periods only required a couple simple changes. Making it work with commas as well was a bit more complicated due to the way the parser ignores commas, but it should also be working properly with these changes.
If there's a worry that parsing commas as decimal points could break things, that part could be removed for now until the locale settings are in place, or if not, it could be merged as is.
Let me know if there's anything else needed from me.
Could I ask the status of this PR? It's approved almost a year ago but still not get merged?
We usually follow a 2-maintainer-approval policy before merging. The next step is for another maintainer to have a look. However, most (all?) dateparser maintainers are also maintainers of many other open source libraries, and our time is limited.
Would love to get some more visibility on this! ♥️
Updating the base branch will take some extra work due to some changes in the language data.
#1079 specifically has caused some issues since it changed the CLDR data instead of the supplementary yaml. I can open a separate PR to fix that.
All languages should be working again now.
Note: The fix in #1090 is also incorporated in this PR.
A code diff overview has been added to the PR description to make it easier to review.
Thank you.