v
v copied to clipboard
time: improve performance of parse_rfc3339/1
Before
SPENT 738.004 ms in decoder2.decode[time.Time]('2022-03-11T13:54:25')!
Now
SPENT 41.814 ms in decoder2.decode[time.Time]('2022-03-11T13:54:25')!
NOTE: I will base myself in this code to create a Parser struct then it can be useful in others standard like above ones
https://ijmacd.github.io/rfc3339-iso8601/
Wow, huge increase in performance, great job!
The failure in TOML seems to happen for values like this:
import time
t := time.parse_rfc3339('1979-05-27T07:32:00-08:00')!
dump(t)
On master, that prints:
[a.v:4] t: 1979-05-27 15:32:00
On this PR, that prints:
[a.v:4] t: 1979-05-26 23:32:00
Other examples, that work on master, but do not here:
import time
dump(time.parse_rfc3339('1979-05-27T07:32:00-08:00')!)
dump(time.parse_rfc3339('22:47:08Z')!)
dump(time.parse_rfc3339('01:47:08.981+03:00')!)
dump(time.parse_rfc3339('2024-10-19T22:47:08-00:00')!)
dump(time.parse_rfc3339('2024-10-19T22:47:08.9+00:00')!)
dump(time.parse_rfc3339('2024-10-20T01:47:08+03:00')!)
dump(time.parse_rfc3339('2024-10-20T01:47:08.981+03:00')!)
On master:
[a.v:3] time.parse_rfc3339('1979-05-27T07:32:00-08:00')!: 1979-05-27 15:32:00
[a.v:4] time.parse_rfc3339('22:47:08Z')!: 0000-11-30 22:47:08
[a.v:5] time.parse_rfc3339('01:47:08.981+03:00')!: 0000-11-29 22:47:08
[a.v:6] time.parse_rfc3339('2024-10-19T22:47:08-00:00')!: 2024-10-19 22:47:08
[a.v:7] time.parse_rfc3339('2024-10-19T22:47:08.9+00:00')!: 2024-10-19 22:47:08
[a.v:8] time.parse_rfc3339('2024-10-20T01:47:08+03:00')!: 2024-10-19 22:47:08
[a.v:9] time.parse_rfc3339('2024-10-20T01:47:08.981+03:00')!: 2024-10-19 22:47:08
On the PR:
[a.v:3] time.parse_rfc3339('1979-05-27T07:32:00-08:00')!: 1979-05-26 23:32:00
================ V panic ================
module: main
function: main()
message: Invalid time format code: 1, error: datetime string is too short
file: a.v:4
I will keep parsing with is_local: false for performance reason.
I think that code, should work as a first priority, and then it can be optimised.
Breaking users of that parser should not happen, unless the parser has a bug, that has to be fixed.
You can get more time examples from: https://ijmacd.github.io/rfc3339-iso8601/ .
@spytheman I think master is broken for 22:47:08Z
- code
import time
fn main() {
dump(time.parse_rfc3339('22:47:08Z')!)
}
- result
time.parse_rfc3339('22:47:08Z')!: 0000-11-30 22:47:08
What do you think should be the result? 0000-00-00 22:47:08 looks great
Only the time part should be set in this case, since the date is not specified.
Note however, that on master, it does not return an error for this case, while it does here, which later leads to problems for the toml tests.
If only a time is specified, and a date is needed, then use the current date. This seems like a logical, reasonable alternative to trying to decide if 0000-00-00 is good or not.
If only a time is specified, and a date is needed, then use the current date. This seems like a logical, reasonable alternative to trying to decide if 0000-00-00 is good or not.
No, it is not, since that is ambiguous and time dependent. If a system sends you a timestamp, or records it in a log, you can not know what the time was, when that timestamp was generated by the system.
Consider also the case, where you have just a date: 2024-10-20 . If you receive such a time, you will not set the time part of it to the arbitrary current hour:minute:second.
In contrast, setting the unknown fields to 0, is predictable, simple and unambiguous.
toml will need some fix
- If date and time are given but offset is None, Datetime corresponds to a Local Date-Time.
from https://docs.rs/toml/latest/toml/value/struct.Datetime.html
from https://toml.io/en/
The toml module has very extensive tests, including external ones (see .github/workflows/download_full_toml_test_suites.sh and .github/workflows/toml_ci.yml that runs it).
I suspect (but can be wrong of course), that the bug is more likely to be in the new parser here, since that is the most recent thing that is changed.
Here is the actual spec: https://toml.io/en/v1.0.0#local-date-time
If only a time is specified, and a date is needed, then use the current date. This seems like a logical, reasonable alternative to trying to decide if 0000-00-00 is good or not.
No, it is not, since that is ambiguous and time dependent. If a system sends you a timestamp, or records it in a log, you can not know what the time was, when that timestamp was generated by the system.
Consider also the case, where you have just a date:
2024-10-20. If you receive such a time, you will not set the time part of it to the arbitrary current hour:minute:second.In contrast, setting the unknown fields to 0, is predictable, simple and unambiguous.
So you're saying the time would be set to 00:00:00... which is exactly midnight.
00:00:00 is a valid time. 0000-00-00 is an invalid date (at least in any computer system I've ever seen. I suppose you could argue it was the exact instant of the Big Bang, but...).
It doesn't make sense to set one to a valid value, and the other to an invalid value.
Is 0000-01-01 better for you?
It has valid year, valid month, and valid day.
For that particular format, any fixed date part will work. We only have information about the time part.
00:00:00 is a valid time. 0000-00-00 is an invalid date (at least in any computer system I've ever seen. I suppose you could argue it was the exact instant of the Big Bang, but...).
To me, 00:00:00 and 0000-00-00 make sense, once Time not specifically is date, time or data-time. There is the possibility to implement a field with a enum signing if it is .date, .time or .date_time; or change fields hour, minute, second [...] to option, if do you prefer
No, the result should be a valid date, @JalonSolov has a point. time.Time{} is an invalid one (having all fields be 0s).
Having a separate field to make it modal will lead to a lot more edge cases and bugs.
On master, the fixed date is 0000-11-30, which is also valid, it is just not intuitive.
It is interesting, that Go for example, refuses to parse 22:47:08Z:
package main
import (
"fmt"
"time"
)
func main() {
t, err := time.Parse(time.RFC3339, "22:47:08Z")
if err != nil {
fmt.Println(err)
}
fmt.Println(t)
}
produces:
parsing time "22:47:08Z" as "2006-01-02T15:04:05Z07:00": cannot parse "22:47:08Z" as "2006"
0001-01-01 00:00:00 +0000 UTC
Rust also errors out for it:
use chrono::format::ParseError;
use chrono::{DateTime};
fn main() -> Result<(), ParseError> {
let rfc3339 = DateTime::parse_from_rfc3339("2016-06-20T12:41:45.14Z")?;
println!("{}", rfc3339);
let rfc3339 = DateTime::parse_from_rfc3339("22:47:08Z")?;
println!("{}", rfc3339);
Ok(())
}
produces:
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.02s
Running `target/debug/smalltime`
2016-06-20 12:41:45.140 +00:00
Error: ParseError(Invalid)
I could not find a good way to test for python (its standard library does not appear to have a rfc3339 parser), or for JS.
This https://it-tools.tech/date-converter could handle it, but not as RFC3339, but as something called "Mongo ObjectID", whatever it is 🤔 .
For that format, it produces:
For RFC3339:
from https://www.rfc-editor.org/rfc/rfc3339#section-4.3
4. Local Time
4.1. Coordinated Universal Time (UTC)
Because the daylight saving rules for local time zones are so
convoluted and can change based on local law at unpredictable times,
true interoperability is best achieved by using Coordinated Universal
Time (UTC). This specification does not cater to local time zone
rules.
I do not see, how that affects what the result of 22:47:08Z should be?
In reading rfc 3339 over (and over and over and...), there is no clear statement that a date needs to be included. HOWEVER, if you look at the ABNF for the format, you see
full-date = date-fullyear "-" date-month "-" date-mday
full-time = partial-time time-offset
date-time = full-date "T" full-time
which at least implies that if you don't have ALL the digits, plus the T plus the time offset, then it is not a valid rfc 3339 format date. This would explain why Go and Rust fail with an error when asked to parse a time only.
This also means that TOML does not support rfc 3339, whether it says so or not. rfc 3339 if very explicit that the time offset must be supplied, as well as the date, there the last 2 example copied from the TOML docs in the screenshot above show just times, and without offsets.
@JalonSolov from https://toml.io/en/v1.0.0#offset-date-time
[Offset Date-Time](https://toml.io/en/v1.0.0#offset-date-time)
To unambiguously represent a specific instant in time,
you *may* use an [RFC 3339](https://tools.ietf.org/html/rfc3339)
formatted date-time with offset.
...
[Local Date-Time](https://toml.io/en/v1.0.0#local-date-time)
If you omit the offset from an [RFC 3339](https://tools.ietf.org/html/rfc3339)
formatted date-time, it will represent the given date-time *without any relation
to an offset or timezone*. It cannot be converted to an instant in time without
additional information. Conversion to an instant, if required, is *implementation-specific*.
...
[Local Date](https://toml.io/en/v1.0.0#local-date)
If you include only the date portion of an [RFC 3339](https://tools.ietf.org/html/rfc3339)
formatted date-time, it will represent that entire day without any relation to an offset
or timezone.
...
[Local Time](https://toml.io/en/v1.0.0#local-time)
If you include *only the time portion* of an [RFC 3339](https://tools.ietf.org/html/rfc3339)
formatted date-time, it will represent that time of day *without any relation to a
specific day* or any offset or timezone.
i.e. in my interpretation, the TOML spec does not require a full RFC3339 date-time
but it allows it. It also allows for time from that spec too.
The shown examples in the linked spec, and the test files in the official
suits also confirm it:
I.e. TOML is very liberal in what it allows, while RFC3339 on the other hand, requires the time zone.
I also found this specialized Python parser module: https://pypi.org/project/pyRFC3339/ .
It also rejects 22:47:08Z :
I've just changed it, so that time.parse_rfc3339 now returns an error, for incomplete time (missing date), like all of the other existing rfc3339 parsers do.
The TOML module now supplies the missing date part, when needed, so that it can continue to be conforming to its own relaxed spec.
This C implementation, also errors on 22:47:08Z:
https://github.com/Bnz-0/rfc3339timestamp/blob/master/rfc3339.h#L52