chrono Add niche optimization for `NaiveTime`

Add niche optimization for `NaiveTime`

Open Kijewski opened this issue 1 year ago • 12 comments

NaiveTime contains two integer fields, both with a restricted range:

secs, the second of the day, is less than 86,400.
frac is used for two things. To store the nanosecond of the second, and a leap second, so its highest value is 1,999,999,999.

So, in order to allow niche optimization, one or both values could be implemented in a type that does not allow higher values. Because frac is stored later in the struct, it has the better option to yield optimized code, e.g. when used with Result<NaiveTime, E>. The compiler has a longer consecutive sequence of bytes to store non-NaiveTime its data.

Sep 06 '22 08:09 Kijewski

Also, why not go with the ux crate or the arbitrary-int crate?

(IMO the awesome version of this is a crate that we can use at test time to generate code into our src directory specifically for the types that we need, without adding a compile-time dependency for chrono. This would allow the upstream maintainers to improve the code over time without us having to take responsibility for a particular implementation.)

Sep 07 '22 10:09 djc

Also, why not go with the ux crate or the arbitrary-int crate?

Neither crate uses niche optimization. I have an open PR for ux to implement it there: https://github.com/kjetilkjeka/uX/pull/49.

My system is far from crappy, but it takes me 17 seconds to compile ux in release mode. I don't think this additional compile time would appropriate to add to chrono.

Sep 07 '22 10:09 Kijewski

If we're going to do this, why not take the 15-bit niche from secs instead of 1 bit from frac?

In my current implementation, if the last bit of the struct is set, then the whole thing it not a NaiveTime. So sec can be every value anyway.

NaiveTime is used as the last member in NaiveDateTime, and DateTime, so it better to have the one bit at the very end of the structs, then somewhere in the middle. Otherwise you make the usable byte sequence a little bit smaller.

Sep 07 '22 10:09 Kijewski

My system is far from crappy, but it takes me 17 seconds to compile ux in release mode. I don't think this additional compile time would appropriate to add to chrono.

Fair point!

If we're going to do this, why not take the 15-bit niche from secs instead of 1 bit from frac?

In my current implementation, if the last bit of the struct is set, then the whole thing it not a NaiveTime. So sec can be every value anyway.

NaiveTime is used as the last member in NaiveDateTime, and DateTime, so it better to have the one bit at the very end of the structs, then somewhere in the middle. Otherwise you make the usable byte sequence a little bit smaller.

The compiler is allowed to reorder fields at will, right? Why wouldn't it do so here? And if not, we can trivially reorder secs and frac within NaiveTime.

Sep 07 '22 11:09 djc

The compiler is allowed to reorder fields at will, right? Why wouldn't it do so here? And if not, we can trivially reorder secs and frac within NaiveTime.

I don't think about that. I don't know if the compiler reorders fields to make use of niches. I only know that it will reorder structs to minimize padding because of alignment issues.

It would be actually kind of trivial to copy the file and have a U17 type, also. But that would probably be overkill. :)

Sep 07 '22 11:09 Kijewski

Sorry to jump in kind of late here, but could we use something like: nonmax to hold the nanos?

Sep 07 '22 12:09 esheppa

nonmax generates terrible byte code, because it has to negate the value on every access. Most likely twice: once for reading, once for writing. This is enough to make the optimizer not understand what is going on.

Otherwise, how often will you access the data? It is obviously great that nonmax does not need unsafe operations to access the data. Nonmax could be "good enough".

Sep 07 '22 12:09 Kijewski

Is it worth us benchmarking both methods so that we have some empirical evidence to inform the direction we choose?

Sep 07 '22 12:09 esheppa

In general, these questions haven't been answered yet:

How have you tested this? Do you have an actual use case where this optimization has a substantial impact?

Sep 07 '22 12:09 djc

Is it worth us benchmarking both methods so that we have some empirical evidence to inform the direction we choose?

Old values use nonmax, new values use U31:

bench_datetime_parse_from_rfc2822                                                                            
                        time:   [196.87 ns 197.23 ns 197.60 ns]
                        change: [+5.3629% +5.6315% +5.8895%] (p = 0.00 < 0.05)
                        Performance has regressed.

bench_datetime_parse_from_rfc3339                                                                            
                        time:   [163.58 ns 164.23 ns 164.98 ns]
                        change: [+6.6111% +7.8651% +9.1036%] (p = 0.00 < 0.05)
                        Performance has regressed.

bench_datetime_from_str time:   [245.06 ns 246.12 ns 247.37 ns]                                    
                        change: [-4.6067% -3.6377% -2.7998%] (p = 0.00 < 0.05)
                        Performance has improved.

bench_datetime_to_rfc2822                                                                             
                        time:   [565.76 ns 568.01 ns 570.94 ns]
                        change: [-2.2735% -1.8146% -1.3809%] (p = 0.00 < 0.05)
                        Performance has improved.

bench_ser_naivedatetime_writer                                                                            
                        time:   [285.23 ns 287.71 ns 290.49 ns]
                        change: [-4.5103% -3.4291% -2.3967%] (p = 0.00 < 0.05)
                        Performance has improved.

bench_ser_naivedatetime_string                                                                            
                        time:   [301.43 ns 302.13 ns 302.88 ns]
                        change: [-6.1510% -5.6113% -5.0359%] (p = 0.00 < 0.05)
                        Performance has improved.

The other benches were within noise threshold. So, bench_datetime_parse_from_rfc2822 & bench_datetime_parse_from_rfc3339 are significantly faster with nonmax, and bench_ser_naivedatetime_writer & bench_ser_naivedatetime_string are faster with U31. I ran both benchmarks twice, so there's a chance that this results are actually meaningful.

This is a strong indication to throw this PR away and go with nonmax instead. Otherwise, there seem to be no benchmarks for arithmetic operations or am I overlooking them?

Sep 07 '22 12:09 Kijewski

How have you tested this?

Whether the optimization works at all you mean? https://github.com/chronotope/chrono/pull/811/commits/fb79395dfbba15f1d51bb47b4934fab5f9613643

Do you have an actual use case where this optimization has a substantial impact?

Any API that returns a fallible DateTime, or stores an Option<DateTime>. In my case I have the latter a lot.

Sep 07 '22 12:09 Kijewski

Interestingly if this works nicely with nonmax, potentially it could be used to replace the i32 of the DateImpl as well, allowing Option<NaiveDate> to be the same size as NaiveDate

Sep 07 '22 12:09 esheppa

chrono chrono copied to clipboard

Add niche optimization for `NaiveTime`

chrono
chrono copied to clipboard