dateparser
dateparser copied to clipboard
False positives when searching dates
OS: Windows 10.0.17763.805 dateparser version: 0.7.2
When using the search_dates()
function some numerical and punctuation mark combinations that don't resemble any date format I've ever seen get picked up as dates.
To reproduce run the following code and replace <false positive>
with any one of the following:
- 8: 100M2_
- 100M
- 10,00 2
- 19,60 5
- 73 20
from dateparser.search import search_dates
search_dates("The following isn't a correct date <false positive>")
Same here on OSX 10.15 with version 0.7.2
Here are some examples of results that should not be dates
search_dates(text,languages=['en'], settings={'STRICT_PARSING': True,'PREFER_DATES_FROM': 'past','DATE_ORDER': 'DMY'}, add_detected_language=True)
-- Clearly wrong ('32° 34’S', datetime.datetime(2013, 10, 16, 23, 59, 7), 'en') ('123°', datetime.datetime(1900, 1, 1, 1, 2, 3), 'en') ('6005', datetime.datetime(2000, 6, 5, 0, 0) ('000', datetime.datetime(1900, 1, 1, 0, 0), 'en') ('of 629', datetime.datetime(1900, 1, 1, 6, 2, 9), 'en') ('>21', datetime.datetime(1900, 1, 1, 2, 1), 'en')
-- I can kind of see where it is getting this but I think it is wrong to do it ('3533', datetime.datetime(2033, 5, 3, 0, 0), 'en')
-- I have lots of numbers in these docs. It should not pick them up and 'make' a date from them ('538400', datetime.datetime(8400, 3, 5, 0, 0), 'en')
FYI some cases will be fixed in the next version (after merging this: https://github.com/scrapinghub/dateparser/pull/786)
Seems #786 has been merged, can you please close this issue
Does https://github.com/scrapinghub/dateparser/pull/786 fix all cases reported here? Otherwise, it makes sense to keep this open.