Watson icon indicating copy to clipboard operation
Watson copied to clipboard

Improve input date parsing

Open jmaupetit opened this issue 9 years ago • 18 comments

When watson fails at parsing an input date, we must raise a clear message.

jmaupetit avatar Mar 30 '15 09:03 jmaupetit

Actually I don't think that our current format (ỲYYY-MM-DD, which is arrow's default one) is very relevant. Maybe we should think about something more flexible or natural?

k4nar avatar Sep 15 '15 17:09 k4nar

Maybe we should think about something more flexible or natural?

Any examples maybe?

willdurand avatar Sep 15 '15 17:09 willdurand

For example git is pretty versatile, it accepts things like "yesterday", "1 week ago" or "15/04/2015" for git log --since. I can't find the reference anywhere though.

k4nar avatar Sep 15 '15 18:09 k4nar

parsedatetime might be a good option (thx @jmaupetit).

k4nar avatar Sep 16 '15 09:09 k4nar

While parsedatetime is great, it's not a perfect fit for Watson. For example it seems to always be expecting MM/DD dates instead of DD/MM. Also it guesses dates in the future (monday is equivalent to next monday, not last monday) but this is not what's expected in the context of Watson.

Right now our system is more restrictive and cumbersome, but at least it's not bug prone.

FTR we must set YearParseStyle to 0 in order to parse Jan 1th as the current year and not the next.

k4nar avatar Sep 17 '15 13:09 k4nar

There's a new (natural language) date parsing library here:

https://github.com/scrapinghub/dateparser

(Haven't tested it yet.)

SpotlightKid avatar Nov 12 '15 20:11 SpotlightKid

:+1: let's give it a try!

jmaupetit avatar Nov 12 '15 20:11 jmaupetit

I experimented with dateparsera bit. I'm not so sure, I like it. It promises more that it can keep. All below are valid German date specifications. In German, for numbers up to 12, we generally use the word form, not the numeral in a sentence. Also "12 Uhr" normally means mid-day, not midnight, for that we use "Null Uhr" or "0 Uhr" to make it clear.

Not sure if it's worth to add such a heavy dependency when in effect it doesn't add much over dateutil .

In [1]: import dateparser

In [2]: dateparser.parse('Vor zwei stunden')

In [3]: dateparser.parse('Vor 2 stunden')
Out[3]: datetime.datetime(2015, 11, 14, 0, 4, 24, 713893)

In [4]: dateparser.parse('In 1 Tag')

In [5]: dateparser.parse('Morgen')

In [6]: dateparser.parse('morgen')

In [7]: dateparser.parse('morgen früh')

In [8]: dateparser.parse('morgen mittag')

In [9]: dateparser.parse('morgen mittag', languages=['de', 'en'])

In [10]: dateparser.parse('morgen 12 uhr', languages=['de', 'en'])

In [11]: dateparser.parse('morgen, 12 uhr', languages=['de', 'en'])

In [12]: dateparser.parse('gestern, 12 Uhr', languages=['de', 'en'])
Out[12]: datetime.datetime(2015, 11, 13, 0, 0)

In [13]: dateparser.parse('gestern, zwölf Uhr', languages=['de', 'en'])

In [14]: dateparser.parse('vorgestern, 12 Uhr', languages=['de', 'en'])
Out[14]: datetime.datetime(2015, 11, 12, 0, 0)

In [15]: dateparser.parse('vorgestern, null Uhr', languages=['de', 'en'])

In [16]: dateparser.parse('vorgestern, 0 Uhr', languages=['de', 'en'])
Out[16]: datetime.datetime(2015, 11, 12, 2, 4, 24, 713893)

SpotlightKid avatar Nov 14 '15 02:11 SpotlightKid

Thank you for your feedback. Shouldn't we focus on English first (and only English) for a CLI?

jmaupetit avatar Nov 16 '15 07:11 jmaupetit

I don't know if you have considered dateutil but I find it useful for fuzzy human date parsing. Example from the doc:

>>> from dateutil.parser import parse
>>> parse("Today is January 1, 2047 at 8:21:00AM", fuzzy_with_tokens=True)
(datetime.datetime(2011, 1, 1, 8, 21), (u'Today is ', u' ', u'at '))

yloiseau avatar Jan 24 '16 16:01 yloiseau

I'd second a switch to dateutil. It's used successfully in the gcalcli project.

Thanks! Scott

firecat53 avatar Mar 02 '16 16:03 firecat53

In my experience, dateutil.parser.parse(s, fuzzy=True) often guesses wrong. If we're going to use it, at the very least we should make options like dayfirst and yearfirst configurable. And it also has the problem of defaulting to dates in the future, e.g. parse("Monday", fuzzy=True) == datetime.datetime(2016, 3, 7, 0, 0).

SpotlightKid avatar Mar 02 '16 17:03 SpotlightKid

To decide whether we should use dateutil or dateparser, I will crunch a test dataset and post the results here later.

jmaupetit avatar Sep 30 '16 16:09 jmaupetit

So, I wrote a quick and dirty script to compare dateutils vs dateparser parse methods:

#!/usr/bin/env python3
"""Compare (fuzzy) dateutils vs dateparser `parse` methods"""

import sys

from dateparser import parse as dp_parse
from datetime import datetime, timedelta
from dateutil.parser import parse as du_parse

NOW = datetime.now()
DP_SETTINGS = {
    'RELATIVE_BASE': NOW,
}
EXPECTED_DATETIME = datetime(year=2016, month=9, day=1)
DATASET = (
    # (query, expected)
    ('2016/09/01', EXPECTED_DATETIME),
    ('2016-09-01', EXPECTED_DATETIME),
    ('09/01/2016', EXPECTED_DATETIME),
    ('09-01-2016', EXPECTED_DATETIME),
    ('09012016', EXPECTED_DATETIME),
    ('09/01/2016 15:20', EXPECTED_DATETIME.replace(hour=15, minute=20)),
    ('09/01/2016 at 15h20', EXPECTED_DATETIME.replace(hour=15, minute=20)),
    ('15 min ago', NOW - timedelta(minutes=15)),
    ('two hours ago', NOW - timedelta(hours=2)),
    ('a day ago', NOW - timedelta(days=1)),
    ('tuesday', (
        NOW.replace(hour=0, minute=0, second=0, microsecond=0) - \
        timedelta(days=(NOW.weekday() - 1)))),
    ('monday at noon', (
        NOW.replace(hour=12, minute=0, second=0, microsecond=0) - \
        timedelta(days=NOW.weekday()))),
)


def is_equal(time1, time2):
    return time1 == time2


def parse(parser, query, expected, **options):
    try:
        result = parser(query, **options)
    except:
        return 0
    if result and is_equal(result, expected):
        return 1
    return 0


def bench(dataset):
    du_scores = []
    dp_scores = []
    template = '| {:25} | {:>10} | {:>10} |'
    separator = template.format('-' * 25, '-' * 10, '-' * 10)

    print(template.format('query', 'dateutil', 'dateparser'))
    print(separator)

    for query, expected in dataset:
        du_score = parse(du_parse, query, expected, fuzzy=True)
        dp_score = parse(dp_parse, query, expected, settings=DP_SETTINGS)
        du_scores.append(du_score)
        dp_scores.append(dp_score)

        print(template.format(query, du_score, dp_score))

    print(separator)
    print(template.format(
        'total ({})'.format(len(du_scores)),
        sum(du_scores),
        sum(dp_scores))
    )


def main():
    bench(DATASET)
    return 0


if __name__ == '__main__':
    sys.exit(main() or 0)

And here are the results:

| query                     |   dateutil | dateparser |
| ------------------------- | ---------- | ---------- |
| 2016/09/01                |          1 |          1 |
| 2016-09-01                |          1 |          1 |
| 09/01/2016                |          1 |          1 |
| 09-01-2016                |          1 |          1 |
| 09012016                  |          0 |          1 |
| 09/01/2016 15:20          |          1 |          1 |
| 09/01/2016 at 15h20       |          1 |          1 |
| 15 min ago                |          0 |          1 |
| two hours ago             |          0 |          1 |
| a day ago                 |          0 |          1 |
| tuesday                   |          0 |          1 |
| monday at noon            |          0 |          1 |
| ------------------------- | ---------- | ---------- |
| total (12)                |          6 |         12 |

If my test data set is relevant with what we expect from Watson's date parser, my conclusion is that we must use dateparser. WDYT?

jmaupetit avatar Oct 06 '16 15:10 jmaupetit

My attitude would be to first hit the big easy ones with a quick lookup, then find a more thorough NLP/18ln approach.

ie today -> datetime.datetime.today().strftime('%Y-%m-%d')

The big easy:

  • today
  • yesterday
  • week

whilei avatar Feb 04 '17 10:02 whilei

Just found Watson and trying out instead of timewarrior. This is a big pain point for me right now. Adding or editing past items is quite annoying with required YYYY-MM-DD HH:mm format.

@davidag Is what is discussed in this thread, e.g. shortcuts for today and yesterday being addressed in #328?

jessebett avatar Oct 22 '19 16:10 jessebett

@jessebett Humanized dates are not supported in #328, but adding by time is (e.g. watson add -f 10:00 -t 11:00).

I'd planned to improve date inputting, but Watson's development is a bit stagnated lately, so I'm looking for alternatives.

davidag avatar Oct 22 '19 18:10 davidag

Thank you for hacking on watson, I'm a daily user and find it really useful!

The ability to adjust the date formatting for the add command would make usability even better for me (for me German dates like 04.02.2022 feel most natural)

Is there anything I can do to help push this forward? I could test and know how to read code, but haven't written much python code myself yet.

teutat3s avatar Feb 04 '22 11:02 teutat3s