dateparser icon indicating copy to clipboard operation
dateparser copied to clipboard

Added support for fractional units

Open DenverCoder1 opened this issue 4 years ago • 16 comments

Fixes #753

Added support for fractional units by making the regex pattern support digits before a decimal point and converting num to a float instead of an int on line 150.

What this changes

Before this fix, parse("in 0.5 hours") would return the time 5 hours in the future instead of 0.5 hours in the future.

With this fix, parsing fractional units are no longer parsed incorrectly.

Tests

>>> parse("now")
datetime.datetime(2021, 2, 7, 18, 50, 10, 107501)
>>> parse("in 0.5 hours")
datetime.datetime(2021, 2, 7, 19, 20, 15, 773800)
>>> parse("in 3.75 hours") 
datetime.datetime(2021, 2, 7, 22, 35, 20, 463238)
>>> parse("2.5 hours ago")  
datetime.datetime(2021, 2, 7, 16, 20, 29, 614937)

What was changed

To make it easier for future reviewers, the change is much smaller than it appears to be. The reason there are 125 files changed is because the regex in write_complete_data.py was updated and ran.

Only 3 lines of code were changed in this PR:

  • 2 lines changed in freshness_date_parser.py (14, 150)
  • 1 line changed in write_complete_data.py (34)
  • Test cases for data were updated (68)
  • Test cases for fractional units added (file)

See these comments for shortcuts to view the changes

DenverCoder1 avatar Feb 07 '21 15:02 DenverCoder1

HI @DenverCoder1, thank you for this PR.

Could you add those tests you mention in the description in the project tests? Thanks!

noviluni avatar Feb 08 '21 12:02 noviluni

@Gallaecio Added some tests for fractional units. Also added some support for numbers with commas instead of periods. Adding comma support was not as simple because of the way the Dictionary splits the strings.

DenverCoder1 avatar Feb 08 '21 17:02 DenverCoder1

I see you’ve had to modify the per-language Python files. Those are not meant to be modified manually, and are instead generated from other files in the repository. See https://dateparser.readthedocs.io/en/latest/contributing.html#guidelines-for-editing-translation-data . On the bright side, it may mean that you just need to edit the script that generates them.

Gallaecio avatar Feb 08 '21 18:02 Gallaecio

@Gallaecio

I see. So I should update write_complete_data.py and the yaml files then rerun write_complete_data.py?

DenverCoder1 avatar Feb 08 '21 18:02 DenverCoder1

I am not yet familiar with that process, but I would assume you would have to edit either write_complete_data.py (hopefully) or the yaml files (if they include the \d pattern), and then rerun write_complete_data.py.

This is also assuming that you can get the desired result editing one of those two files. If the patterns came directly from CLDR (unlikely but possible), it may get messier.

Please have a look at en.yml and write_complete_data.py, see if you can figure out where the change should go. Otherwise, @noviluni is the person most familiar with the process, so he probably knows.

Gallaecio avatar Feb 08 '21 20:02 Gallaecio

It seems to me that some of the data in the .py files is coming from the date_translation_data JSON files and some parts (such as decades) are coming from the supplementary YAML files. I made changes in both places to get the .py files to fully update.

DenverCoder1 avatar Feb 08 '21 20:02 DenverCoder1

The JSON files are not meant to be modified (they come from http://cldr.unicode.org/), so this may be not as trivial as I had hoped.

Gallaecio avatar Feb 08 '21 21:02 Gallaecio

The JSON files are not meant to be modified (they come from http://cldr.unicode.org), so this may be not as trivial as I had hoped.

I didn't need to modify the JSON, only had to change the way it is converted in write_complete_data.py.

DenverCoder1 avatar Feb 08 '21 21:02 DenverCoder1

Looking at https://github.com/scrapinghub/dateparser/pull/859 maybe it’s just a matter or reopening.

Gallaecio avatar Feb 09 '21 08:02 Gallaecio

Codecov Report

Base: 98.23% // Head: 98.23% // Increases project coverage by +0.00% :tada:

Coverage data is based on head (5c5b3f1) compared to base (d052d3e). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #876   +/-   ##
=======================================
  Coverage   98.23%   98.23%           
=======================================
  Files         232      232           
  Lines        2602     2604    +2     
=======================================
+ Hits         2556     2558    +2     
  Misses         46       46           
Impacted Files Coverage Δ
dateparser/data/date_translation_data/af.py 100.00% <ø> (ø)
dateparser/data/date_translation_data/am.py 100.00% <ø> (ø)
dateparser/data/date_translation_data/ar.py 100.00% <ø> (ø)
dateparser/data/date_translation_data/ast.py 100.00% <ø> (ø)
dateparser/data/date_translation_data/az-Latn.py 100.00% <ø> (ø)
dateparser/data/date_translation_data/az.py 100.00% <ø> (ø)
dateparser/data/date_translation_data/be.py 100.00% <ø> (ø)
dateparser/data/date_translation_data/bg.py 100.00% <ø> (ø)
dateparser/data/date_translation_data/bn.py 100.00% <ø> (ø)
dateparser/data/date_translation_data/br.py 100.00% <ø> (ø)
... and 105 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar Feb 10 '21 08:02 codecov[bot]

Hi @DenverCoder1 @Gallaecio Thank you for the PR.

I like the approach, however, I'm don't want to merge this as-is, because not all the locale/regions accept both (period and comma) as decimal separator. This could lead to some errors.

For example, if we have 1.500 s we don't know if it's 1 point 5 seconds or 1 thousand 500 seconds. In UK and US, they normally use comma (,) as a thousand separator, while in Spanish we usually use the period (.) as a thousand separator.

I tried with this code, and this is what I got:

>>> datetime.now() - dateparser.parse('1,5000 s', languages=['en'])
datetime.timedelta(0, 4999, 995131)

>>> datetime.now() - dateparser.parse('1.5000 s', languages=['en'])
datetime.timedelta(0, 1, 496419)
>>> datetime.now() - dateparser.parse('1,5000 s', languages=['es'])
datetime.timedelta(0, 4999, 996924)

>>> datetime.now() - dateparser.parse('1.5000 s', languages=['es'])
datetime.timedelta(0, 1, 496970)

I personally think that the decimal/comma separator should be configured by default by locale and we should probably be able to configure it manually using the settings, so this approach wouldn't probably work.

Let me know what you think / your ideas :)

noviluni avatar Feb 10 '21 08:02 noviluni

I think you have a point, but I also imagine that those examples are interpreted as 5000 with master. If so, I imagine this change would not break anything that is parsed fine at the moment, but only extend support for fractional numbers without thousand separators.

If so, I would not mind merging as is and creating 2 separate tickets (1) for restricting supported decimal separators based on locale (we can hopefully use data from CLDR to choose the right ones) and (2) for supporting thousand separators.

But yes, we could also aim to solve all of that in this patch. I’m just worried about the extra time that could require.

Gallaecio avatar Feb 10 '21 10:02 Gallaecio

@Gallaecio @noviluni What's the status on this? I don't think I have the time now to try making it locale-based as suggested. I agree that it could be made a separate PR.

As of now, it will make things such as "2.5 hours", etc. parse correctly when right now they don't ("2.5" gets treated as "5").

To make it work with periods only required a couple simple changes. Making it work with commas as well was a bit more complicated due to the way the parser ignores commas, but it should also be working properly with these changes.

If there's a worry that parsing commas as decimal points could break things, that part could be removed for now until the locale settings are in place, or if not, it could be merged as is.

Let me know if there's anything else needed from me.

DenverCoder1 avatar Mar 21 '21 15:03 DenverCoder1

Could I ask the status of this PR? It's approved almost a year ago but still not get merged?

repl-samuel avatar Feb 09 '22 15:02 repl-samuel

We usually follow a 2-maintainer-approval policy before merging. The next step is for another maintainer to have a look. However, most (all?) dateparser maintainers are also maintainers of many other open source libraries, and our time is limited.

Gallaecio avatar Feb 09 '22 19:02 Gallaecio

Would love to get some more visibility on this! ♥️

jgayfer avatar Jun 13 '22 20:06 jgayfer

Updating the base branch will take some extra work due to some changes in the language data.

#1079 specifically has caused some issues since it changed the CLDR data instead of the supplementary yaml. I can open a separate PR to fix that.

DenverCoder1 avatar Oct 23 '22 21:10 DenverCoder1

All languages should be working again now.

Note: The fix in #1090 is also incorporated in this PR.

A code diff overview has been added to the PR description to make it easier to review.

DenverCoder1 avatar Oct 24 '22 00:10 DenverCoder1

Thank you.

serhii73 avatar Oct 31 '22 16:10 serhii73