dateparser icon indicating copy to clipboard operation
dateparser copied to clipboard

dateparser not able to parse things like next tuesday.

Open saroup opened this issue 4 years ago • 19 comments

It parses Tuesday to the date of the Tuesday of the current week but when input is next Tuesday it returns none.

saroup avatar Oct 09 '19 14:10 saroup

I'm having the same issue except it's returning Tuesday of the previous week:

>>> parse('now').strftime('%a %Y-%m-%d')
'Mon 2019-10-21'
>>> parse('tuesday').strftime('%a %Y-%m-%d')
'Tue 2019-10-15'
>>> parse('next tuesday').strftime('%a %Y-%m-%d')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'strftime'

tannercollin avatar Oct 21 '19 21:10 tannercollin

Hi, I am Gargi Vyas. I am GSOC 2020 candidate and would like to work on this bug.

GargiVyas31 avatar Mar 02 '20 03:03 GargiVyas31

That'd be awesome, Gargi. Thanks!

tannercollin avatar Mar 02 '20 06:03 tannercollin

@Gallaecio, @noviluni I have looked through the code and I think I understand the gist of it at this point. What would the recommended way to tackle this? Would appreciate some suggestions to get started. A separate function in FreshnessDateDataParser maybe?

aditya-hari avatar Mar 11 '20 14:03 aditya-hari

@aditya-hari Go ahead and propose an approach in a pull request. It’s easier to discuss over code :slightly_smiling_face:

Gallaecio avatar Mar 12 '20 18:03 Gallaecio

@Gallaecio I haven't really come up with anything concrete in code yet, can't open a pull request.

Things like 'next tuesday' aren't identified with any locale, so there has to be some changes made in the locale info to sort that out. I am not entirely sure how to though.

I thought about just changing the date_string to something standard like "in x days/months" but that will obviously only work for English if implemented in that way.

aditya-hari avatar Mar 14 '20 07:03 aditya-hari

@aditya-hari I suggest you start from FreshnessDateDataParser.parse, go through what the code does keeping the target strings in mind (e.g. “Next Tuesday”), and make the required changes as you go. I see for example that ago and in are hardcoded in some parts, I guess you will need to add next there.

You could add a test for “Next Tuesday”, extend FreshnessDateDataParser.parse as needed until it is parsed successfully, and then make sure no other tests are broken after your changes.

Gallaecio avatar Mar 14 '20 10:03 Gallaecio

@Gallaecio

Sorry it is taking me this long, I have something sort of working, I will hopefully open a PR soon. However the way I am doing this won't be able to handle the "after 15 days" situation mentioned in #635

aditya-hari avatar Mar 15 '20 18:03 aditya-hari

Not a problem. It’s OK to just fix “Next ” for now, we can improve things later with additional, separate changes.

Gallaecio avatar Mar 16 '20 18:03 Gallaecio

There are a lot of time-related translations available in the unicode-cldr xml or json files that could definitely be used to augment dateparser.py with things that handle all sorts of variations like 'Next Tuesday'. Of course, I'd also like to see something that cover 'Next Weekend' or 'on the weekend'... but it doesn't look like that's been defined as yet.

Anyway, what would it take to pull in the cldr datefields for each language and incorporate them?

https://github.com/unicode-cldr/cldr-dates-full/blob/master/main/ru/dateFields.json

jgtimestuff avatar Mar 20 '20 10:03 jgtimestuff

Okay, just saw this issue from 2 years ago -- cldr_language_data | move data directory | 2 years ago So, is it just that the cldr_language_data needs updating to include more variations of 'next'?

My apologies... it seems there is a script to do just this already in the code: #487 CLDR script update - https://github.com/scrapinghub/dateparser/compare/cldr-script-update

Is the current dictionary up to date or is it just that the existing code isn't calling things like 'next' that already exist in the code?

jgtimestuff avatar Mar 20 '20 10:03 jgtimestuff

In freshness_date_parser, I think we need to add something from calendar to get the right day of the week?

    td = relativedelta(**kwargs)
	

relativedelta arguments for 'next' + dayofweek needs to add a day, then check the calendar for the next one?

today = datetime.datetime.now() (happens to be Friday) today + relativedelta.relativedelta(weekday=calendar.FRIDAY)

today + rld.relativedelta(weekday=calendar.FRIDAY) datetime.datetime(2020, 3, 20, 8, 55, 7, 615746) [today, instead of next friday]

so, we have to add a day to today, then look for next Friday:

today + rld.relativedelta(days=+1) datetime.datetime(2020, 3, 21, 8, 55, 7, 615746) today = today + rld.relativedelta(days=+1) today + rld.relativedelta(weekday=calendar.TUESDAY) datetime.datetime(2020, 3, 24, 8, 55, 7, 615746) today + rld.relativedelta(weekday=calendar.FRIDAY) datetime.datetime(2020, 3, 27, 8, 55, 7, 615746) [ Next Friday ]

today = today + rld.relativedelta(days=+1) today + rld.relativedelta(weekday=calendar.TUESDAY) datetime.datetime(2020, 3, 24, 8, 55, 7, 615746) [ Next Tuesday ]

jgtimestuff avatar Mar 20 '20 14:03 jgtimestuff

https://dateparser.readthedocs.io/en/latest/contributing.html#guidelines-for-editing-translation-data might shed some light

Gallaecio avatar Mar 20 '20 14:03 Gallaecio

Thanks, after reading through that link, it seems that is about extending linguistic terms beyond what is provided by the CLDR json files. It looks to me like the json files from CLDR were last imported to dateparser in 2018 and they seem to have a lot less options for relative terms (in English as well as all languages) than what is currently available. This might aid in fixing the 'next weekday' issue...

Although, supplementing that data with 'weekend' would definitely fall under extending the terms as the files don't seem to cover terms like weekend or perhaps even 'fortnight' used by Aussie's etc...

jgtimestuff avatar Mar 23 '20 12:03 jgtimestuff

Ahh wait, now I see that this script looks at the CLDR but only chooses a subset of the available relative terms to transition to dateparser.

https://github.com/scrapinghub/dateparser/blob/master/scripts/get_cldr_data.py

jgtimestuff avatar Mar 23 '20 13:03 jgtimestuff

So we might want to extend that subset as needed by changing the download script, and re-running.

Gallaecio avatar Apr 09 '20 17:04 Gallaecio

Hi everyone, Thanks for looking at this issue. There is no updates since more than one year. Any work around we could use?

thinow avatar Sep 21 '21 11:09 thinow

my workaround is to load both dateparser and parsedatetime and use the latter when the former fails. :)

anarcat avatar Mar 24 '22 15:03 anarcat

This is my work around:

days_long = ['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday']
for day in days_long:                                                                    
    print('trying to find:', day)                                                        
    if day in time:                                                                      
        print('found', day)                                                              
        delta = 1                                                                        
        while day not in (datetime.now() + timedelta(days=delta)).strftime('%A').lower():
            delta += 1                                                                   
            print('delta:', delta)                                                       
            if delta > 14: raise # just to make sure                                     
        if re.findall(r'\d|noon|midnight', time):                                        
            date = (datetime.now() + timedelta(days=delta)).strftime('%Y-%m-%d')         
        else:                                                                            
            date = daystr((datetime.now() + timedelta(days=delta)))                      
        print('date:', date)                                                             
        time = time.replace(day, date).replace('next', '')                               
        print('time:', time)                                                             
        break # only first match                                                         
    else:                                                                                
        print('not found')                                                               

It's janky, but it works.

tannercollin avatar Apr 04 '22 02:04 tannercollin