dateparser icon indicating copy to clipboard operation
dateparser copied to clipboard

False positives when searching dates

Open McWillie opened this issue 4 years ago • 4 comments

OS: Windows 10.0.17763.805 dateparser version: 0.7.2

When using the search_dates() function some numerical and punctuation mark combinations that don't resemble any date format I've ever seen get picked up as dates.

To reproduce run the following code and replace <false positive> with any one of the following:

  • 8: 100M2_
  • 100M
  • 10,00 2
  • 19,60 5
  • 73 20
from dateparser.search import search_dates

search_dates("The following isn't a correct date <false positive>")

McWillie avatar Nov 01 '19 08:11 McWillie

Same here on OSX 10.15 with version 0.7.2

Here are some examples of results that should not be dates

search_dates(text,languages=['en'], settings={'STRICT_PARSING': True,'PREFER_DATES_FROM': 'past','DATE_ORDER': 'DMY'}, add_detected_language=True)

-- Clearly wrong ('32° 34’S', datetime.datetime(2013, 10, 16, 23, 59, 7), 'en') ('123°', datetime.datetime(1900, 1, 1, 1, 2, 3), 'en') ('6005', datetime.datetime(2000, 6, 5, 0, 0) ('000', datetime.datetime(1900, 1, 1, 0, 0), 'en') ('of 629', datetime.datetime(1900, 1, 1, 6, 2, 9), 'en') ('>21', datetime.datetime(1900, 1, 1, 2, 1), 'en')

-- I can kind of see where it is getting this but I think it is wrong to do it ('3533', datetime.datetime(2033, 5, 3, 0, 0), 'en')

-- I have lots of numbers in these docs. It should not pick them up and 'make' a date from them ('538400', datetime.datetime(8400, 3, 5, 0, 0), 'en')

murray-minito avatar Nov 20 '19 05:11 murray-minito

FYI some cases will be fixed in the next version (after merging this: https://github.com/scrapinghub/dateparser/pull/786)

noviluni avatar Sep 21 '20 13:09 noviluni

Seems #786 has been merged, can you please close this issue

gavishpoddar avatar Apr 01 '21 09:04 gavishpoddar

Does https://github.com/scrapinghub/dateparser/pull/786 fix all cases reported here? Otherwise, it makes sense to keep this open.

Gallaecio avatar Apr 05 '21 10:04 Gallaecio