dateparser icon indicating copy to clipboard operation
dateparser copied to clipboard

Unneeded patterns/rules influence the result of the parsing

Open robin-xyzt-ai opened this issue 2 years ago • 0 comments

This might be an issue with retree rather than with the dateparser though.

The following test (which you cannot execute via the public API) fails:

    @Test
    public void parserWithLimitedPatterns(){
        List<String> rules = Arrays.asList(
          "(?<year>\\d{4})\\W{1}(?<month>\\d{1,2})\\W{1}(?<day>\\d{1,2})[^\\d]?",
          "\\W*(?:at )?(?<hour>\\d{1,2}):(?<minute>\\d{1,2})(?::(?<second>\\d{1,2}))?(?:[.,](?<ns>\\d{1,9}))?(?<zero>z)?",
          " ?(?<zoneOffset>[-+]\\d{1,2}:?(?:\\d{2})?)"
        );

        DateParser dateParser = new DateParser(rules, new HashSet<>(rules), Collections.emptyMap(), true, false);
        String input = "2022-08-09 19:04:31.600000+00:00";
        Date date = dateParser.parseDate(input);
        assertEquals(parser.parseDate(input), date);
    }

Note how those 3 rules should be sufficient to parse the date.

  • There is a rule for the year-month-day part
  • There is a rule for the hours:minutes:seconds.ns part
  • There is a rule for the zone offset part

However, during parsing the zoneoffset rule is never used. Instead, it uses the rule for the hours twice.

The weird thing is that when I add a rule that should not be used (`" ?(?\d{4})$"), the test suddenly succeeds:

    @Test
    public void parserWithLimitedPatterns(){
        List<String> rules = Arrays.asList(
          "(?<year>\\d{4})\\W{1}(?<month>\\d{1,2})\\W{1}(?<day>\\d{1,2})[^\\d]?",
          " ?(?<year>\\\\d{4})$",
          "\\W*(?:at )?(?<hour>\\d{1,2}):(?<minute>\\d{1,2})(?::(?<second>\\d{1,2}))?(?:[.,](?<ns>\\d{1,9}))?(?<zero>z)?",
          " ?(?<zoneOffset>[-+]\\d{1,2}:?(?:\\d{2})?)"
        );

        DateParser dateParser = new DateParser(rules, new HashSet<>(rules), Collections.emptyMap(), true, false);
        String input = "2022-08-09 19:04:31.600000+00:00";
        Date date = dateParser.parseDate(input);
        assertEquals(parser.parseDate(input), date);
    }

The position where I add that additional rule is important. For example adding it at the end of the list instead of at index 1 makes the test fail again.

I bumped into this issue for PR https://github.com/sisyphsu/dateparser/pull/28 , where I try to reduce the number of rules that are used for parsing to improve the performance.

robin-xyzt-ai avatar Feb 17 '23 17:02 robin-xyzt-ai