v icon indicating copy to clipboard operation
v copied to clipboard

time: improve performance of parse_rfc3339/1

Open enghitalo opened this issue 1 year ago • 13 comments

Before

 SPENT   738.004 ms in decoder2.decode[time.Time]('2022-03-11T13:54:25')!

Now

 SPENT    41.814 ms in decoder2.decode[time.Time]('2022-03-11T13:54:25')!

NOTE: I will base myself in this code to create a Parser struct then it can be useful in others standard like above ones https://ijmacd.github.io/rfc3339-iso8601/

enghitalo avatar Oct 19 '24 14:10 enghitalo

Wow, huge increase in performance, great job!

esquerbatua avatar Oct 19 '24 14:10 esquerbatua

The failure in TOML seems to happen for values like this:

import time

t := time.parse_rfc3339('1979-05-27T07:32:00-08:00')!
dump(t)

On master, that prints: [a.v:4] t: 1979-05-27 15:32:00

On this PR, that prints: [a.v:4] t: 1979-05-26 23:32:00

spytheman avatar Oct 19 '24 22:10 spytheman

Other examples, that work on master, but do not here:

import time

dump(time.parse_rfc3339('1979-05-27T07:32:00-08:00')!)
dump(time.parse_rfc3339('22:47:08Z')!)
dump(time.parse_rfc3339('01:47:08.981+03:00')!)
dump(time.parse_rfc3339('2024-10-19T22:47:08-00:00')!)
dump(time.parse_rfc3339('2024-10-19T22:47:08.9+00:00')!)
dump(time.parse_rfc3339('2024-10-20T01:47:08+03:00')!)
dump(time.parse_rfc3339('2024-10-20T01:47:08.981+03:00')!)

On master:

[a.v:3] time.parse_rfc3339('1979-05-27T07:32:00-08:00')!: 1979-05-27 15:32:00
[a.v:4] time.parse_rfc3339('22:47:08Z')!: 0000-11-30 22:47:08
[a.v:5] time.parse_rfc3339('01:47:08.981+03:00')!: 0000-11-29 22:47:08
[a.v:6] time.parse_rfc3339('2024-10-19T22:47:08-00:00')!: 2024-10-19 22:47:08
[a.v:7] time.parse_rfc3339('2024-10-19T22:47:08.9+00:00')!: 2024-10-19 22:47:08
[a.v:8] time.parse_rfc3339('2024-10-20T01:47:08+03:00')!: 2024-10-19 22:47:08
[a.v:9] time.parse_rfc3339('2024-10-20T01:47:08.981+03:00')!: 2024-10-19 22:47:08

On the PR:

[a.v:3] time.parse_rfc3339('1979-05-27T07:32:00-08:00')!: 1979-05-26 23:32:00
================ V panic ================
   module: main
 function: main()
  message: Invalid time format code: 1, error: datetime string is too short
     file: a.v:4

spytheman avatar Oct 19 '24 22:10 spytheman

I will keep parsing with is_local: false for performance reason.

I think that code, should work as a first priority, and then it can be optimised.

Breaking users of that parser should not happen, unless the parser has a bug, that has to be fixed.

spytheman avatar Oct 19 '24 22:10 spytheman

You can get more time examples from: https://ijmacd.github.io/rfc3339-iso8601/ .

spytheman avatar Oct 19 '24 22:10 spytheman

@spytheman I think master is broken for 22:47:08Z

  • code
import time

fn main() {
	dump(time.parse_rfc3339('22:47:08Z')!)
}
  • result
time.parse_rfc3339('22:47:08Z')!: 0000-11-30 22:47:08

What do you think should be the result? 0000-00-00 22:47:08 looks great

enghitalo avatar Oct 19 '24 23:10 enghitalo

Only the time part should be set in this case, since the date is not specified.

Note however, that on master, it does not return an error for this case, while it does here, which later leads to problems for the toml tests.

spytheman avatar Oct 19 '24 23:10 spytheman

If only a time is specified, and a date is needed, then use the current date. This seems like a logical, reasonable alternative to trying to decide if 0000-00-00 is good or not.

JalonSolov avatar Oct 20 '24 00:10 JalonSolov

If only a time is specified, and a date is needed, then use the current date. This seems like a logical, reasonable alternative to trying to decide if 0000-00-00 is good or not.

No, it is not, since that is ambiguous and time dependent. If a system sends you a timestamp, or records it in a log, you can not know what the time was, when that timestamp was generated by the system.

Consider also the case, where you have just a date: 2024-10-20 . If you receive such a time, you will not set the time part of it to the arbitrary current hour:minute:second.

In contrast, setting the unknown fields to 0, is predictable, simple and unambiguous.

spytheman avatar Oct 20 '24 00:10 spytheman

toml will need some fix

  • If date and time are given but offset is None, Datetime corresponds to a Local Date-Time.

from https://docs.rs/toml/latest/toml/value/struct.Datetime.html image

from https://toml.io/en/ image

enghitalo avatar Oct 20 '24 01:10 enghitalo

The toml module has very extensive tests, including external ones (see .github/workflows/download_full_toml_test_suites.sh and .github/workflows/toml_ci.yml that runs it).

I suspect (but can be wrong of course), that the bug is more likely to be in the new parser here, since that is the most recent thing that is changed.

spytheman avatar Oct 20 '24 01:10 spytheman

Here is the actual spec: https://toml.io/en/v1.0.0#local-date-time image

spytheman avatar Oct 20 '24 01:10 spytheman

If only a time is specified, and a date is needed, then use the current date. This seems like a logical, reasonable alternative to trying to decide if 0000-00-00 is good or not.

No, it is not, since that is ambiguous and time dependent. If a system sends you a timestamp, or records it in a log, you can not know what the time was, when that timestamp was generated by the system.

Consider also the case, where you have just a date: 2024-10-20 . If you receive such a time, you will not set the time part of it to the arbitrary current hour:minute:second.

In contrast, setting the unknown fields to 0, is predictable, simple and unambiguous.

So you're saying the time would be set to 00:00:00... which is exactly midnight.

00:00:00 is a valid time. 0000-00-00 is an invalid date (at least in any computer system I've ever seen. I suppose you could argue it was the exact instant of the Big Bang, but...).

It doesn't make sense to set one to a valid value, and the other to an invalid value.

JalonSolov avatar Oct 20 '24 03:10 JalonSolov

Is 0000-01-01 better for you? It has valid year, valid month, and valid day. For that particular format, any fixed date part will work. We only have information about the time part.

spytheman avatar Oct 20 '24 06:10 spytheman

00:00:00 is a valid time. 0000-00-00 is an invalid date (at least in any computer system I've ever seen. I suppose you could argue it was the exact instant of the Big Bang, but...).

To me, 00:00:00 and 0000-00-00 make sense, once Time not specifically is date, time or data-time. There is the possibility to implement a field with a enum signing if it is .date, .time or .date_time; or change fields hour, minute, second [...] to option, if do you prefer

enghitalo avatar Oct 20 '24 06:10 enghitalo

No, the result should be a valid date, @JalonSolov has a point. time.Time{} is an invalid one (having all fields be 0s). Having a separate field to make it modal will lead to a lot more edge cases and bugs.

spytheman avatar Oct 20 '24 07:10 spytheman

On master, the fixed date is 0000-11-30, which is also valid, it is just not intuitive.

spytheman avatar Oct 20 '24 07:10 spytheman

It is interesting, that Go for example, refuses to parse 22:47:08Z:

package main

import (
	"fmt"
	"time"
)

func main() {
	t, err := time.Parse(time.RFC3339, "22:47:08Z")
	if err != nil {
		fmt.Println(err)
	}
	fmt.Println(t)
}

produces:

parsing time "22:47:08Z" as "2006-01-02T15:04:05Z07:00": cannot parse "22:47:08Z" as "2006"
0001-01-01 00:00:00 +0000 UTC

Rust also errors out for it:

use chrono::format::ParseError;
use chrono::{DateTime};

fn main() -> Result<(), ParseError> {
    let rfc3339 = DateTime::parse_from_rfc3339("2016-06-20T12:41:45.14Z")?;
    println!("{}", rfc3339);
    let rfc3339 = DateTime::parse_from_rfc3339("22:47:08Z")?;
    println!("{}", rfc3339);
    Ok(())
}

produces:

    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.02s
     Running `target/debug/smalltime`
2016-06-20 12:41:45.140 +00:00
Error: ParseError(Invalid)

spytheman avatar Oct 20 '24 08:10 spytheman

I could not find a good way to test for python (its standard library does not appear to have a rfc3339 parser), or for JS.

spytheman avatar Oct 20 '24 08:10 spytheman

This https://it-tools.tech/date-converter could handle it, but not as RFC3339, but as something called "Mongo ObjectID", whatever it is 🤔 . For that format, it produces: image

For RFC3339: image

spytheman avatar Oct 20 '24 08:10 spytheman

from https://www.rfc-editor.org/rfc/rfc3339#section-4.3

4. Local Time

4.1. Coordinated Universal Time (UTC)

   Because the daylight saving rules for local time zones are so
   convoluted and can change based on local law at unpredictable times,
   true interoperability is best achieved by using Coordinated Universal
   Time (UTC).  This specification does not cater to local time zone
   rules.

image

enghitalo avatar Oct 20 '24 09:10 enghitalo

I do not see, how that affects what the result of 22:47:08Z should be?

spytheman avatar Oct 20 '24 10:10 spytheman

In reading rfc 3339 over (and over and over and...), there is no clear statement that a date needs to be included. HOWEVER, if you look at the ABNF for the format, you see

   full-date       = date-fullyear "-" date-month "-" date-mday
   full-time       = partial-time time-offset

   date-time       = full-date "T" full-time

which at least implies that if you don't have ALL the digits, plus the T plus the time offset, then it is not a valid rfc 3339 format date. This would explain why Go and Rust fail with an error when asked to parse a time only.

This also means that TOML does not support rfc 3339, whether it says so or not. rfc 3339 if very explicit that the time offset must be supplied, as well as the date, there the last 2 example copied from the TOML docs in the screenshot above show just times, and without offsets.

JalonSolov avatar Oct 21 '24 00:10 JalonSolov

@JalonSolov from https://toml.io/en/v1.0.0#offset-date-time

[Offset Date-Time](https://toml.io/en/v1.0.0#offset-date-time)

To unambiguously represent a specific instant in time, 
you *may* use an [RFC 3339](https://tools.ietf.org/html/rfc3339) 
formatted date-time with offset.

...
[Local Date-Time](https://toml.io/en/v1.0.0#local-date-time)

If you omit the offset from an [RFC 3339](https://tools.ietf.org/html/rfc3339)
formatted date-time, it will represent the given date-time *without any relation
to an offset or timezone*. It cannot be converted to an instant in time without
additional information. Conversion to an instant, if required, is *implementation-specific*.

...
[Local Date](https://toml.io/en/v1.0.0#local-date)

If you include only the date portion of an [RFC 3339](https://tools.ietf.org/html/rfc3339)
formatted date-time, it will represent that entire day without any relation to an offset
or timezone.

...
[Local Time](https://toml.io/en/v1.0.0#local-time)

If you include *only the time portion* of an [RFC 3339](https://tools.ietf.org/html/rfc3339)
formatted date-time, it will represent that time of day *without any relation to a 
specific day* or any offset or timezone.

i.e. in my interpretation, the TOML spec does not require a full RFC3339 date-time but it allows it. It also allows for time from that spec too.

The shown examples in the linked spec, and the test files in the official suits also confirm it: image

I.e. TOML is very liberal in what it allows, while RFC3339 on the other hand, requires the time zone.

spytheman avatar Oct 21 '24 05:10 spytheman

I also found this specialized Python parser module: https://pypi.org/project/pyRFC3339/ . It also rejects 22:47:08Z : image

spytheman avatar Oct 21 '24 06:10 spytheman

I've just changed it, so that time.parse_rfc3339 now returns an error, for incomplete time (missing date), like all of the other existing rfc3339 parsers do.

The TOML module now supplies the missing date part, when needed, so that it can continue to be conforming to its own relaxed spec.

spytheman avatar Oct 21 '24 08:10 spytheman

This C implementation, also errors on 22:47:08Z: https://github.com/Bnz-0/rfc3339timestamp/blob/master/rfc3339.h#L52

spytheman avatar Oct 21 '24 08:10 spytheman