chrono Parsing numeric literals can cause TooShort/Invalid

For a format/input string item that starts with a numeric literal, the various parse_from_str methods can return a ParseError error result of TooShort or Invalid if that literal comes directly after a numeric specifier/value with variable parsing width (PW=∞).

In other words, this problem can occur when, for a consecutive pair of format items, a and b, a is a specifier value of variable width and b is a literal that starts with a digit:

%s1 timestamp followed by literal digit
%Y1 year followed by literal digit

Of particular note is that this same sequence of (specifier, literal) manifests as errors of two different variants:

`TooShort` Example:

#[test]
fn test_timestamp() {
    use chrono::{DateTime, format::ParseErrorKind::TooShort};

    let literals = ['1', 'a'];
    for literal in literals {
        let input = format!("1{literal}");
        let fmt = format!("%s{literal}");

        let parsed = DateTime::parse_from_str(&input, &fmt);

        if literal.is_ascii_digit() {
            assert_eq!(parsed.unwrap_err().kind(), TooShort);
        } else {
            assert!(parsed.is_ok());
        }
    }
}

This same behavior exists for the other (PW=∞) specifiers, such as Year (e.g., NaiveDate::parse_from_str("01-01-+100001", "%m-%d-%Y1").

`Invalid` Example

Note that this case contains the same specifier-literal pair as above, but the error is different. Probably because the pair is not at the end of the strings.

#[test]
fn test_year2() {
    use chrono::{NaiveDate, format::ParseErrorKind::Invalid};

    let literals = ['1', 'a'];
    for literal in literals {
        let input = format!("+10000{literal}-01-01");
        let fmt = format!("%Y{literal}-%m-%d");

        let parsed = NaiveDate::parse_from_str(&input, &fmt);

        if literal.is_ascii_digit() {
            assert_eq!(parsed.unwrap_err().kind(), Invalid);
        } else {
            assert!(parsed.is_ok());
        }
    }
}

Cause?

I believe this has to do with the fact that scan::number is greedy against any next digit(s), even if that may consume a next literal. In general, the parse algorithm does not appear to backtrack to find another possible match location. This is an understandable compromise. Allowing for this case would essentially be implementing a regex engine.

I do concede that this kind of input is pathological and would rarely occur in practical usage. But the current implementation feels wrong that it can do this just depending on the "numericality" of a literal, at least without it being documented. And, while there are a few mentions that the parser is greedy, it seems to only consider the cases where the input is missing obvious text from its format. (On the other hand, maybe it's just ok to say the parser is greedy as we currently do, and call it a day. Not sure.)

Options

If this finding is valid, (and indeed, if we do not want to implement a regex-engine-style approach) here are a few ways we could handle it:

During input parsing, reject numerically-prefixed literals after variable-length specifiers (Year, IsoYear, and Timestamp) by returning Invalid or some new error variant (e.g. GreedyValue). We could determine this with minimal (?) overhead by something like:
```
// pseudocode
// after parsing a last item's text and before parsing the current one...
if last_item_parsed_with_variable_width && cur_item_is_literal && cur_item_literal_starts_with_number:
  return Err(GreedyValue);
```
Note that Year only exhibits this behavior when parsed with a leading +/-, so regular 4-digit years would never error this way.
The same as above, but during format parsing (i.e., StrftimeItems::new(fmt)). This would short-circuit on formats where the input could exhibit this behavior, but would be a false positive in the 4-digit Year case (no +/-), which we can only know once we start parsing the input.
Or, just call this out in the documentation: Update TooShort and Invalid's documentation to include this possibility, and maybe in some other high-level documentation places as well. This would be easiest and obviously cause no performance overhead, but maybe not as correct as we could be.

Versions

rustc: rustc 1.77.0-nightly (5518eaa94 2024-01-29)
chrono: 0.4.33

Feb 01 '24 19:02 t-mart

Wow, nice report here @t-mart !

I recommend providing a PR (assuming @pitdicker @djc are also interested and would like to see a PR first). Pick the option that you feel would be best to implement.

I recommend at least two commits to the PR to demonstrate the behavioral change

add any necessary new test cases (the test cases probably already exist somewhere in src/format/parse.rs)
code change + test case adjustments

Chrono maintainers may decide on a different option but tangible code is a great starting point for a longer discussion.

Overall, I agree that the errors from the various parse methods are broad. When implementing a parse function call, I often have to "throw spaghetti at the wall" until I get something sticking because the error offers little insight.

Feb 02 '24 22:02 jtmoon79

@t-mart Thank you for the report. We had an issue and closed PR related to the problem of some specifies accepting a variable number of digits without look-ahead: #560 and #561.

The main difficulty is that we support iterating over the format string while iterating over the input string. So we currently don't know the next formatting specifier while parsing some input.

I'll have a better look at this issue later, maybe tomorrow. Would you be willing to work on it?

Feb 03 '24 06:02 pitdicker

In general, the parse algorithm does not appear to backtrack to find another possible match location. This is an understandable compromise. Allowing for this case would essentially be implementing a regex engine.

The linked old PR shows a compromise. If backtracking can be limited to only one or a few functions that do the integer parsing I don't think it is a bad solution.

Feb 03 '24 06:02 pitdicker

The main difficulty is that we support iterating over the format string while iterating over the input string. So we currently don't know the next formatting specifier while parsing some input.

One option is to add some complexity to the strftime parser to read ahead when it encounters a variable-with formatting specifier. If it is followed by a fixed number of digits (either other formatting specifiers or plain digits), it stores this number of fixed digits as extra information in the formatting item. If a variable-with specifier is followed by another variable-with specifier it is an error.

Then the parser can use this information to leave some digits remaining when parsing a variable-with specifier.

Disadvantage is the double parsing of parts of the format string. And that doing so is unnecessary for the other use: formatting. But that may not turn out to be much of a problem in practice. The double parsing of a few characters in the format string would only happen in cases where we now fail to parse an input string at all.

Feb 03 '24 06:02 pitdicker

Thank you for your consideration and guidance, everyone. I intend to PR on this. Give me a few days to get to some free time.

Feb 05 '24 20:02 t-mart

@t-mart thanks for leading effort on this issue.

Can you please share if you had time to look into it?

vector.dev community uses chrono for datetime parsing and interest in this feature is growing with those cases: https://github.com/vectordotdev/vrl/issues/117 https://github.com/vectordotdev/vrl/issues/790 which is mentioned by vector.dev maintainer here: https://github.com/chronotope/chrono/issues/560 as you seem to be aware.

No pressure, just curios on whether slipped through the cracks or any help is needed. Thanks again.

Apr 12 '24 11:04 DimDroll

Can you please share if you had time to look into it?

Thanks for the update/reminder, @DimDroll.

Regrettably, I have not had time to start work here. Please jump in!

Apr 15 '24 12:04 t-mart

Fox expectations: this is something that is not an easy fix that can be done in a couple of hours of work (at least it is not easy for me :smile:). Not even sure if it can even be done in a backwards-compatible way.

But feel free to dive in and come up with a plan.

Apr 15 '24 13:04 pitdicker

chrono chrono copied to clipboard

Parsing numeric literals can cause TooShort/Invalid

TooShort Example:

Invalid Example

Cause?

Options

Versions

chrono
chrono copied to clipboard

`TooShort` Example:

`Invalid` Example