improve ISO8601 support
Hello!
I created a script doing DB dumps in YAML and doing that I manipulated dates extensively. I noticed that in a rare case, yaml.stringify could produce a valid ISO8601 date that would not be parsed as a Date by yaml.parse(). This made me decide to improve the RegExp used by yaml so that it supports at least the formats it can output. But since I was tinkling with those, I improved ISO8601 support overall.
The case that could not work: when a date had no seconds nor milliseconds, it was trimmed by this function:
stringify: ({ value }) =>
(value as Date).toISOString().replace(/((T00:00)?:00)?\.000Z$/, '')
Which resulted in valid dates like: 2024-10-02T09:00. Sadly, once in .parse(value, { customTags: ["timestamp"] }) provide a value of "2024-10-02T09:00" instead of the expected Date instance.
I also added support for:
- optional minutes
- optional
-characters - optional
:characters - limit the number of characters for ms to max 3
Note: though 20241002 is a valid ISO8601 date, yaml.parse() still gives priority to the number type above timestamp. Though both are testing true. Could that become an issue? The YAML spec is not very explicit about what timestamp formats should be allowed and how they should be disambiguated.
This one's tricky, because for YAML 1.1 !!timestamp has a rather explicit definition: https://yaml.org/type/timestamp.html
Could it be that you're attempting to parse serialized timestamps without overriding the YAML version from its default of 1.2, which does not support !!timestamp? Or is there some date which is serialized in a way that does not match the expectations of the parser?
Separately, there does seem to be a bug to fix in the timestamp serialiser, as this ought to include an explicit tag:
> YAML.stringify(new Date(), { customTags: ["timestamp"] })
'2024-10-02T11:22:59.726Z\n'
This does work for other custom tags, such as:
> YAML.stringify(new Set(), { customTags: ["set"] })
'!!set {}\n'
Thanks @eemeli, I missed this page about the timestamp definition. It was exactly the one I was searching for but could not put my hands on! Then the RegExp currently in use is more broad than the one officially supported but even so does not support the output format of stringify that is still not compliant to the timestamp spec.
2024-01-02T02:00 is a possible output of the current stringify timestamp's method. It does not parse as a Date since it does not satisfy the current regexp.
What would like me to do? I see 3 possible things:
- support any ISO8601 in its extended form (that is the most compliant to the current timestamp spec, but still broader than the spec and a little broader than the current implementation)
- support all ISO8801 dates (even without - and :) : that corresponds to this MR
- restrict the current RegExp to the one provided in the spec and change stringify to output only valid dates (I think removing the
replace()is the best thing to do: it will improve performances and only output ISO8601 complete dates that will be more similar across the board)
The problem with 3. is that it might prevent yaml from correctly parsing files it issued before (even so as I stated before, it can already happen in some rare instances). But I think being more compliant to the YAML spec is the better outcome longterm to prevent possible issues with other implementations.
Buggy example just for the sake of being complete and well understood:
const yaml = require("yaml");
const res = yaml.parse(
yaml.stringify(
{ date: new Date("2024-04-01T02:00:00Z") },
{ customTags: ["timestamp"] }
),
{ customTags: ["timestamp"] }
);
console.log(res.date, res.date instanceof Date)
// 2024-04-01T02:00 false
Ok, so a couple of things are wrong here:
- The test regexp indeed does not match the spec exactly, as it allows single-digit minutes and seconds, does not allow a trailing
.without digits as the fractional part, and does not support time offsets of 30 hours or more. - Stripping the integer seconds from the time on serialization is creating output that won't re-parse as a timestamp.
- When serialising with
customTags: ["timestamp"], the value is not explicitly tagged.
I think the test regexp should not be changed, as that's liable to be a breaking change either way, and the deviations aren't too bad.
The second-stripping during serialization should be dropped.
The !!timestamp tag should be explicitly included when it's used as a custom tag, rather than as a part of the YAML 1.1 core schema.
I mostly agree about everything you just said. The only thing I wonder is about the test RegExp. I think it could be expanded to support everything that the official doc supports but keep support of the small things it parses that are not in spec (if they are any). The idea would be to be at least compliant. It would not be too complicated no? Maybe it's already the case. Hard to compare them…
As far as I can tell, the only features that the current test regexp does not support and the one in the spec does are:
- Seconds ending in a dot, as in
2001-12-15T02:59:43. - Time offsets of 30 or more hours, as in
2001-12-14T21:59:43.10-35:00
Both of those look like mistakes in the original spec, tbh. I'd be happy to consider adding support for them if there's a real-world case that is impacted, but that seems unlikely.
I think you are right.
Looking at the codebase, it seems to me that custom tags use default: true to know if they should emit an explicit tag. Since we need to conditionalize this to the schema used, I think it requires to support a function here right? If so, fixing this is a little more involved.
I propose to change this MR to fix 2.. That would take care of the biggest issue (emitting values in a format that do not parse back into a similar data type).
Once that is merged, we can do another one to tackle 3..
I think you are correct about the currect regexp.
Looking at the codebase, it seems to me that custom tags use
default: trueto know if they should emit an explicit tag. Since we need to conditionalize this to the schema used, I think it requires to support a function here right? If so, fixing this is a little more involved.
There's an existing pattern used here that should probably work for this use case as well: https://github.com/eemeli/yaml/blob/5adbb605b64094b57bc95cbc24587e6f36b3f9a8/src/compose/compose-scalar.ts#L75-L77
@eemeli I just updated this MR with the fix for 2. Stripping the integer seconds from the time on serialization is creating output that won't re-parse as a timestamp. I think 3. should be fixed in a follow-up.