chrono
chrono copied to clipboard
Parsing numeric literals can cause TooShort/Invalid
For a format/input string item that starts with a numeric literal, the various parse_from_str
methods can return a ParseError
error result of TooShort
or Invalid
if that literal comes directly after a numeric specifier/value with variable parsing width (PW=∞).
In other words, this problem can occur when, for a consecutive pair of format items, a
and b
, a
is a specifier value of variable width and b
is a literal that starts with a digit:
-
%s1
timestamp followed by literal digit -
%Y1
year followed by literal digit
Of particular note is that this same sequence of (specifier, literal) manifests as errors of two different variants:
TooShort
Example:
#[test]
fn test_timestamp() {
use chrono::{DateTime, format::ParseErrorKind::TooShort};
let literals = ['1', 'a'];
for literal in literals {
let input = format!("1{literal}");
let fmt = format!("%s{literal}");
let parsed = DateTime::parse_from_str(&input, &fmt);
if literal.is_ascii_digit() {
assert_eq!(parsed.unwrap_err().kind(), TooShort);
} else {
assert!(parsed.is_ok());
}
}
}
This same behavior exists for the other (PW=∞) specifiers, such as Year (e.g., NaiveDate::parse_from_str("01-01-+100001", "%m-%d-%Y1")
.
Invalid
Example
Note that this case contains the same specifier-literal pair as above, but the error is different. Probably because the pair is not at the end of the strings.
#[test]
fn test_year2() {
use chrono::{NaiveDate, format::ParseErrorKind::Invalid};
let literals = ['1', 'a'];
for literal in literals {
let input = format!("+10000{literal}-01-01");
let fmt = format!("%Y{literal}-%m-%d");
let parsed = NaiveDate::parse_from_str(&input, &fmt);
if literal.is_ascii_digit() {
assert_eq!(parsed.unwrap_err().kind(), Invalid);
} else {
assert!(parsed.is_ok());
}
}
}
Cause?
I believe this has to do with the fact that scan::number
is greedy against any next digit(s), even if that may consume a next literal. In general, the parse algorithm does not appear to backtrack to find another possible match location. This is an understandable compromise. Allowing for this case would essentially be implementing a regex engine.
I do concede that this kind of input is pathological and would rarely occur in practical usage. But the current implementation feels wrong that it can do this just depending on the "numericality" of a literal, at least without it being documented. And, while there are a few mentions that the parser is greedy, it seems to only consider the cases where the input is missing obvious text from its format. (On the other hand, maybe it's just ok to say the parser is greedy as we currently do, and call it a day. Not sure.)
Options
If this finding is valid, (and indeed, if we do not want to implement a regex-engine-style approach) here are a few ways we could handle it:
-
During input parsing, reject numerically-prefixed literals after variable-length specifiers (
Year
,IsoYear
, andTimestamp
) by returningInvalid
or some new error variant (e.g.GreedyValue
). We could determine this with minimal (?) overhead by something like:// pseudocode // after parsing a last item's text and before parsing the current one... if last_item_parsed_with_variable_width && cur_item_is_literal && cur_item_literal_starts_with_number: return Err(GreedyValue);
Note that
Year
only exhibits this behavior when parsed with a leading+
/-
, so regular 4-digit years would never error this way. -
The same as above, but during format parsing (i.e.,
StrftimeItems::new(fmt)
). This would short-circuit on formats where the input could exhibit this behavior, but would be a false positive in the 4-digit Year case (no+
/-
), which we can only know once we start parsing the input. -
Or, just call this out in the documentation: Update
TooShort
andInvalid
's documentation to include this possibility, and maybe in some other high-level documentation places as well. This would be easiest and obviously cause no performance overhead, but maybe not as correct as we could be.
Versions
- rustc:
rustc 1.77.0-nightly (5518eaa94 2024-01-29)
- chrono:
0.4.33
Wow, nice report here @t-mart !
I recommend providing a PR (assuming @pitdicker @djc are also interested and would like to see a PR first). Pick the option that you feel would be best to implement.
I recommend at least two commits to the PR to demonstrate the behavioral change
- add any necessary new test cases (the test cases probably already exist somewhere in
src/format/parse.rs
) - code change + test case adjustments
Chrono maintainers may decide on a different option but tangible code is a great starting point for a longer discussion.
Overall, I agree that the errors from the various parse methods are broad. When implementing a parse function call, I often have to "throw spaghetti at the wall" until I get something sticking because the error offers little insight.
@t-mart Thank you for the report. We had an issue and closed PR related to the problem of some specifies accepting a variable number of digits without look-ahead: #560 and #561.
The main difficulty is that we support iterating over the format string while iterating over the input string. So we currently don't know the next formatting specifier while parsing some input.
I'll have a better look at this issue later, maybe tomorrow. Would you be willing to work on it?
In general, the parse algorithm does not appear to backtrack to find another possible match location. This is an understandable compromise. Allowing for this case would essentially be implementing a regex engine.
The linked old PR shows a compromise. If backtracking can be limited to only one or a few functions that do the integer parsing I don't think it is a bad solution.
The main difficulty is that we support iterating over the format string while iterating over the input string. So we currently don't know the next formatting specifier while parsing some input.
One option is to add some complexity to the strftime parser to read ahead when it encounters a variable-with formatting specifier. If it is followed by a fixed number of digits (either other formatting specifiers or plain digits), it stores this number of fixed digits as extra information in the formatting item. If a variable-with specifier is followed by another variable-with specifier it is an error.
Then the parser can use this information to leave some digits remaining when parsing a variable-with specifier.
Disadvantage is the double parsing of parts of the format string. And that doing so is unnecessary for the other use: formatting. But that may not turn out to be much of a problem in practice. The double parsing of a few characters in the format string would only happen in cases where we now fail to parse an input string at all.
Thank you for your consideration and guidance, everyone. I intend to PR on this. Give me a few days to get to some free time.
@t-mart thanks for leading effort on this issue.
Can you please share if you had time to look into it?
vector.dev community uses chrono for datetime parsing and interest in this feature is growing with those cases: https://github.com/vectordotdev/vrl/issues/117 https://github.com/vectordotdev/vrl/issues/790 which is mentioned by vector.dev maintainer here: https://github.com/chronotope/chrono/issues/560 as you seem to be aware.
No pressure, just curios on whether slipped through the cracks or any help is needed. Thanks again.
Can you please share if you had time to look into it?
Thanks for the update/reminder, @DimDroll.
Regrettably, I have not had time to start work here. Please jump in!
Fox expectations: this is something that is not an easy fix that can be done in a couple of hours of work (at least it is not easy for me :smile:). Not even sure if it can even be done in a backwards-compatible way.
But feel free to dive in and come up with a plan.