dateparser icon indicating copy to clipboard operation
dateparser copied to clipboard

WIP: Reimplementing `search_dates`

Open gavishpoddar opened this issue 2 years ago • 19 comments

Reimplementing and simplifying search_dates

A reimplemented and simplified search_dates which more directly uses dateparser.parse, improves accuracy and fixes many bugs

New Feature:

  • search_first_date - searches and returns the first date from the given text.

NOTE: This PR is inspired by the previous implementation of search_dates and #931.

TODO

  • [x] Gather feedback on the API and it's functions.
  • [x] Write docs.
  • [x] Write tests.
  • [x] Chack search translation joints.
  • [x] Fix DATE_ORDER

gavishpoddar avatar Jul 16 '21 22:07 gavishpoddar

This PR will also solve many issues with the search_dates.

  • #924
  • #894
  • #918
  • #521
  • #582
  • #681
  • #774
  • #706 (Correct datetime object but wrong string) (related issue 2)
  • #856 (Correct datetime object but wrong string) (related issue 2)

Somewhat address.

  • #507 (Now returns date object but not range)
  • #689 (Now returnes datetime.datetime(2019, 2, 10, 0, 0) but the issue is not clear)
  • #695
  • #787 (Now returnes datetime object but the issue is not clear)

Some other issue fixed:

2021-08-04T14:21:37+05:30

would parse 05:30 only after this PR

2021-08-04T14:21:37 and 05:30

NOTE: THIS LIST IS NO LONGER ACCURATE ANYMORE

gavishpoddar avatar Jul 16 '21 23:07 gavishpoddar

Hi, I need a suggestion should I use translated chunks or the original chunks to further parse the data objects. I have currently used translated chunks instead of original chunks as this increased accuracy in some basic tests.

Thanks, and please suggest.

gavishpoddar avatar Jul 17 '21 11:07 gavishpoddar

Hi, I need a suggestion should I use translated chunks or the original chunks to further parse the data objects. I have currently used translated chunks instead of original chunks as this increased accuracy in some basic tests.

Thanks, and please suggest.

Hi @gavishpoddar, using translated chunks makes sense. :+1:

noviluni avatar Jul 19 '21 07:07 noviluni

Codecov Report

Base: 98.23% // Head: 98.10% // Decreases project coverage by -0.12% :warning:

Coverage data is based on head (99e66c6) compared to base (0f5d9c5). Patch coverage: 96.12% of modified lines in pull request are covered.

:exclamation: Current head 99e66c6 differs from pull request most recent head 2935aae. Consider uploading reports for the commit 2935aae to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #945      +/-   ##
==========================================
- Coverage   98.23%   98.10%   -0.13%     
==========================================
  Files         232      235       +3     
  Lines        2604     2692      +88     
==========================================
+ Hits         2558     2641      +83     
- Misses         46       51       +5     
Impacted Files Coverage Δ
dateparser/search/search.py 94.73% <94.33%> (-4.64%) :arrow_down:
dateparser/search/__init__.py 100.00% <100.00%> (ø)
dateparser/search/languages.py 100.00% <100.00%> (ø)
dateparser/date_parser.py 93.75% <0.00%> (-0.19%) :arrow_down:
dateparser/date.py 99.24% <0.00%> (-0.04%) :arrow_down:
dateparser/parser.py 99.01% <0.00%> (-0.01%) :arrow_down:
dateparser/utils/__init__.py 95.27% <0.00%> (ø)
dateparser/languages/locale.py 98.69% <0.00%> (ø)
dateparser/conf.py 100.00% <0.00%> (ø)
dateparser/__init__.py 100.00% <0.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar Jul 19 '21 07:07 codecov[bot]

Hi @noviluni, I have made the changes for tests to work. Can you please approve the workflow approval?

gavishpoddar avatar Jul 21 '21 18:07 gavishpoddar

@gavishpoddar workflows approved 👍

lopuhin avatar Jul 22 '21 06:07 lopuhin

Hi @noviluni, @lopuhin, @kishan3 The new search_dates implementation is nearing its completion.

Code reviews would be great to receive.

gavishpoddar avatar Aug 03 '21 17:08 gavishpoddar

I need help with one test

StaticTzInfo 'UTC' is expected. UTC is received from new search_dates

gavishpoddar avatar Aug 03 '21 17:08 gavishpoddar

Hi after some changes the codes are very much compatible with the old search_dates. Some tests are modified as the working of the search_dates has been changed.

Currently, this PR is left with some docs changes (just replacing the current with new docs and search_first_date).

gavishpoddar avatar Aug 06 '21 10:08 gavishpoddar

Some tests are modified as the working of the search_dates has been changed.

What are the differences?

Also I can see that right now both implementations are present, do we plan to have them both, or to remove the old?

What is the performance of the new approach compared to the old one?

lopuhin avatar Aug 16 '21 07:08 lopuhin

Please use this link to see changes to tests.

gavishpoddar avatar Aug 16 '21 15:08 gavishpoddar

Speed

New Implementation (This PR) - 44.380413369 seconds Old Implementation - 44.226806061 seconds

After removing dates

New Implementation (This PR) - 42.456919767seconds Old Implementation - 40.198545251 seconds

Output

New Implementation (This PR)

[('23 September 63', datetime.datetime(2063, 9, 23, 0, 0)), ('19 August AD 14', datetime.datetime(2014, 8, 19, 0, 0)), ('the "Year the Four', datetime.datetime(2063, 4, 23, 0, 0))]

Old Implementation

[('23 September 63', datetime.datetime(2063, 9, 23, 0, 0)), ('19 August AD 14', datetime.datetime(2014, 8, 19, 0, 0)), ('a', datetime.datetime(2014, 1, 19, 0, 0)), ('4] The', datetime.datetime(2014, 4, 19, 0, 0)), ('42', datetime.datetime(2042, 4, 19, 0, 0)), ('of 75', datetime.datetime(1975, 4, 19, 0, 0))]

Code

article = """

Caesar Augustus (23 September 63 BC – 19 August AD 14), also known as Octavian (Latin: Octavianus) when referring to his early career, was the first Roman emperor, reigning from 27 BC until his death in AD 14.[a] His status as the founder of the Roman Principate (the first phase of the Roman Empire) has consolidated a legacy as one of the most effective leaders in human history.[4] The reign of Augustus initiated an era of relative peace known as the Pax Romana. The Roman world was largely free from large-scale conflict for more than two centuries, despite continuous wars of imperial expansion on the Empire's frontiers and the year-long civil war known as the "Year of the Four Emperors" over the imperial succession.
Originally named Gaius Octavius, he was born into an old and wealthy equestrian branch of the plebeian gens Octavia. His maternal great-uncle Julius Caesar was assassinated in 44 BC and Octavius was named in Caesar's will as his adopted son and heir; as a result, he inherited Caesar's name, estate, and the loyalty of his legions. He, Mark Antony and Marcus Lepidus formed the Second Triumvirate to defeat the assassins of Caesar. Following their victory at the Battle of Philippi (42 BC), the Triumvirate divided the Roman Republic among themselves and ruled as de facto dictators. The Triumvirate was eventually torn apart by the competing ambitions of its members; Lepidus was exiled in 36 BC and Antony was defeated by Octavian at the Battle of Actium in 31 BC.
After the demise of the Second Triumvirate, Augustus restored the outward façade of the free Republic, with governmental power vested in the Roman Senate, the executive magistrates and the legislative assemblies, yet maintained autocratic authority by having the Senate grant him lifetime tenure as supreme military command, tribune and censor. A similar ambiguity is seen in his chosen names, the implied rejection of monarchical titles whereby he called himself Princeps Civitatis (First Citizen) juxtaposed with his adoption of the ancient title Augustus.
Augustus dramatically enlarged the Empire, annexing Egypt, Dalmatia, Pannonia, Noricum and Raetia, expanding possessions in Africa, and completing the conquest of Hispania, but suffered a major setback in Germania. Beyond the frontiers, he secured the Empire with a buffer region of client states and made peace with the Parthian Empire through diplomacy. He reformed the Roman system of taxation, developed networks of roads with an official courier system, established a standing army, established the Praetorian Guard, official police and fire-fighting services for Rome, and rebuilt much of the city during his reign. Augustus died in AD 14 at the age of 75, probably from natural causes. Persistent rumors, substantiated somewhat by deaths in the imperial family, have claimed his wife Livia poisoned him. He was succeeded as emperor by his adopted son Tiberius, Livia's son and also former husband of Augustus' only biological daughter Julia.
 """ * 100

import time
start = time.process_time()

search_dates(article)

print(time.process_time() - start)

@noviluni @lopuhin @kishan3

gavishpoddar avatar Aug 16 '21 21:08 gavishpoddar

Please use this link to see changes to tests.

Thanks, that looks acceptable to me, given amount of things this fixes.

Speed

@gavishpoddar thanks, that's interesting - didn't expect it to be faster on long inputs. Would you mind also testing on other inputs, including shorter ones? But given results so far, I'd say it makes sense to remove the old implementation.

lopuhin avatar Aug 18 '21 10:08 lopuhin

Hi @gavishpoddar , could you do a similar test you did but using another language like French, Hindi, Spanish, or whatever, just to see the differences? Thanks!

noviluni avatar Aug 18 '21 18:08 noviluni

New speed test

New Implementation (This PR) - 44.228506362999994 seconds Old Implementation - 44.339635815 seconds

New speed test with the Spanish language

Note: Only one date object was found in both the new and old search_dates

New Implementation (This PR) - 0.450311745 seconds Old Implementation - 0.5171154470000001 seconds

gavishpoddar avatar Aug 19 '21 16:08 gavishpoddar

Hi @lopuhin, @noviluni, @kishan3

Please suggest should I go ahead and replace the search_dates, some more tests are required or keeping both the implementation.

gavishpoddar avatar Aug 21 '21 13:08 gavishpoddar

Replaced previous search_dates with the new implimentaion

gavishpoddar avatar Aug 24 '21 14:08 gavishpoddar

Hey, this PR is updated with the #932

gavishpoddar avatar Sep 07 '21 16:09 gavishpoddar

Hi @gavishpoddar,

I found your PR to improve the behavior of search_dates. There has been no further progress for almost a year. What is the current status of this PR?

Is there anything I can do to help get this PR merged?

Eckii24 avatar Aug 06 '22 14:08 Eckii24