dateparser
dateparser copied to clipboard
WIP: Reimplementing `search_dates`
Reimplementing and simplifying search_dates
A reimplemented and simplified search_dates
which more directly uses dateparser.parse
, improves accuracy and fixes many bugs
New Feature:
-
search_first_date
- searches and returns the first date from the given text.
NOTE: This PR is inspired by the previous implementation of search_dates and #931.
TODO
- [x] Gather feedback on the API and it's functions.
- [x] Write docs.
- [x] Write tests.
- [x] Chack search translation joints.
- [x] Fix
DATE_ORDER
This PR will also solve many issues with the search_dates
.
- #924
- #894
- #918
- #521
- #582
- #681
- #774
- #706 (Correct
datetime
object but wrong string) (related issue 2) - #856 (Correct
datetime
object but wrong string) (related issue 2)
Somewhat address.
- #507 (Now returns date object but not range)
- #689 (Now returnes datetime.datetime(2019, 2, 10, 0, 0) but the issue is not clear)
- #695
- #787 (Now returnes
datetime
object but the issue is not clear)
Some other issue fixed:
2021-08-04T14:21:37+05:30
would parse 05:30
only after this PR
2021-08-04T14:21:37
and 05:30
NOTE: THIS LIST IS NO LONGER ACCURATE ANYMORE
Hi, I need a suggestion should I use translated chunks or the original chunks to further parse the data objects. I have currently used translated chunks instead of original chunks as this increased accuracy in some basic tests.
Thanks, and please suggest.
Hi, I need a suggestion should I use translated chunks or the original chunks to further parse the data objects. I have currently used translated chunks instead of original chunks as this increased accuracy in some basic tests.
Thanks, and please suggest.
Hi @gavishpoddar, using translated chunks
makes sense. :+1:
Codecov Report
Base: 98.23% // Head: 98.10% // Decreases project coverage by -0.12%
:warning:
Coverage data is based on head (
99e66c6
) compared to base (0f5d9c5
). Patch coverage: 96.12% of modified lines in pull request are covered.
:exclamation: Current head 99e66c6 differs from pull request most recent head 2935aae. Consider uploading reports for the commit 2935aae to get more accurate results
Additional details and impacted files
@@ Coverage Diff @@
## master #945 +/- ##
==========================================
- Coverage 98.23% 98.10% -0.13%
==========================================
Files 232 235 +3
Lines 2604 2692 +88
==========================================
+ Hits 2558 2641 +83
- Misses 46 51 +5
Impacted Files | Coverage Δ | |
---|---|---|
dateparser/search/search.py | 94.73% <94.33%> (-4.64%) |
:arrow_down: |
dateparser/search/__init__.py | 100.00% <100.00%> (ø) |
|
dateparser/search/languages.py | 100.00% <100.00%> (ø) |
|
dateparser/date_parser.py | 93.75% <0.00%> (-0.19%) |
:arrow_down: |
dateparser/date.py | 99.24% <0.00%> (-0.04%) |
:arrow_down: |
dateparser/parser.py | 99.01% <0.00%> (-0.01%) |
:arrow_down: |
dateparser/utils/__init__.py | 95.27% <0.00%> (ø) |
|
dateparser/languages/locale.py | 98.69% <0.00%> (ø) |
|
dateparser/conf.py | 100.00% <0.00%> (ø) |
|
dateparser/__init__.py | 100.00% <0.00%> (ø) |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
Hi @noviluni, I have made the changes for tests to work. Can you please approve the workflow approval?
@gavishpoddar workflows approved 👍
Hi @noviluni, @lopuhin, @kishan3
The new search_dates
implementation is nearing its completion.
Code reviews would be great to receive.
I need help with one test
StaticTzInfo 'UTC'
is expected.
UTC
is received from new search_dates
Hi after some changes the codes are very much compatible with the old search_dates
. Some tests are modified as the working of the search_dates
has been changed.
Currently, this PR is left with some docs changes (just replacing the current with new docs and search_first_date
).
Some tests are modified as the working of the search_dates has been changed.
What are the differences?
Also I can see that right now both implementations are present, do we plan to have them both, or to remove the old?
What is the performance of the new approach compared to the old one?
Please use this link to see changes to tests.
Speed
New Implementation (This PR) - 44.380413369 seconds Old Implementation - 44.226806061 seconds
After removing dates
New Implementation (This PR) - 42.456919767seconds Old Implementation - 40.198545251 seconds
Output
New Implementation (This PR)
[('23 September 63', datetime.datetime(2063, 9, 23, 0, 0)), ('19 August AD 14', datetime.datetime(2014, 8, 19, 0, 0)), ('the "Year the Four', datetime.datetime(2063, 4, 23, 0, 0))]
Old Implementation
[('23 September 63', datetime.datetime(2063, 9, 23, 0, 0)), ('19 August AD 14', datetime.datetime(2014, 8, 19, 0, 0)), ('a', datetime.datetime(2014, 1, 19, 0, 0)), ('4] The', datetime.datetime(2014, 4, 19, 0, 0)), ('42', datetime.datetime(2042, 4, 19, 0, 0)), ('of 75', datetime.datetime(1975, 4, 19, 0, 0))]
Code
article = """
Caesar Augustus (23 September 63 BC – 19 August AD 14), also known as Octavian (Latin: Octavianus) when referring to his early career, was the first Roman emperor, reigning from 27 BC until his death in AD 14.[a] His status as the founder of the Roman Principate (the first phase of the Roman Empire) has consolidated a legacy as one of the most effective leaders in human history.[4] The reign of Augustus initiated an era of relative peace known as the Pax Romana. The Roman world was largely free from large-scale conflict for more than two centuries, despite continuous wars of imperial expansion on the Empire's frontiers and the year-long civil war known as the "Year of the Four Emperors" over the imperial succession.
Originally named Gaius Octavius, he was born into an old and wealthy equestrian branch of the plebeian gens Octavia. His maternal great-uncle Julius Caesar was assassinated in 44 BC and Octavius was named in Caesar's will as his adopted son and heir; as a result, he inherited Caesar's name, estate, and the loyalty of his legions. He, Mark Antony and Marcus Lepidus formed the Second Triumvirate to defeat the assassins of Caesar. Following their victory at the Battle of Philippi (42 BC), the Triumvirate divided the Roman Republic among themselves and ruled as de facto dictators. The Triumvirate was eventually torn apart by the competing ambitions of its members; Lepidus was exiled in 36 BC and Antony was defeated by Octavian at the Battle of Actium in 31 BC.
After the demise of the Second Triumvirate, Augustus restored the outward façade of the free Republic, with governmental power vested in the Roman Senate, the executive magistrates and the legislative assemblies, yet maintained autocratic authority by having the Senate grant him lifetime tenure as supreme military command, tribune and censor. A similar ambiguity is seen in his chosen names, the implied rejection of monarchical titles whereby he called himself Princeps Civitatis (First Citizen) juxtaposed with his adoption of the ancient title Augustus.
Augustus dramatically enlarged the Empire, annexing Egypt, Dalmatia, Pannonia, Noricum and Raetia, expanding possessions in Africa, and completing the conquest of Hispania, but suffered a major setback in Germania. Beyond the frontiers, he secured the Empire with a buffer region of client states and made peace with the Parthian Empire through diplomacy. He reformed the Roman system of taxation, developed networks of roads with an official courier system, established a standing army, established the Praetorian Guard, official police and fire-fighting services for Rome, and rebuilt much of the city during his reign. Augustus died in AD 14 at the age of 75, probably from natural causes. Persistent rumors, substantiated somewhat by deaths in the imperial family, have claimed his wife Livia poisoned him. He was succeeded as emperor by his adopted son Tiberius, Livia's son and also former husband of Augustus' only biological daughter Julia.
""" * 100
import time
start = time.process_time()
search_dates(article)
print(time.process_time() - start)
@noviluni @lopuhin @kishan3
Please use this link to see changes to tests.
Thanks, that looks acceptable to me, given amount of things this fixes.
Speed
@gavishpoddar thanks, that's interesting - didn't expect it to be faster on long inputs. Would you mind also testing on other inputs, including shorter ones? But given results so far, I'd say it makes sense to remove the old implementation.
Hi @gavishpoddar , could you do a similar test you did but using another language like French, Hindi, Spanish, or whatever, just to see the differences? Thanks!
New speed test
New Implementation (This PR) - 44.228506362999994 seconds Old Implementation - 44.339635815 seconds
New speed test with the Spanish language
Note: Only one date object was found in both the new and old search_dates
New Implementation (This PR) - 0.450311745 seconds Old Implementation - 0.5171154470000001 seconds
Hi @lopuhin, @noviluni, @kishan3
Please suggest should I go ahead and replace the search_dates
, some more tests are required or keeping both the implementation.
Replaced previous search_dates
with the new implimentaion
Hey, this PR is updated with the #932
Hi @gavishpoddar,
I found your PR to improve the behavior of search_dates
.
There has been no further progress for almost a year. What is the current status of this PR?
Is there anything I can do to help get this PR merged?