arrow-julia
arrow-julia copied to clipboard
move timezone support to extension
When looking at @time_imports
on a transitive reverse dependency, I saw that TimeZones adds a substantial time to a package that doesn't need that part of Arrow. So I took a stab at moving time zone support to an extension. I've done this in a way that should still be Julia pre 1.9 compatible. There is one slightly hacky thing: the extension mechanism is based primarily on the idea of adding additional methods to existing functions and not for defining additional types. As such, it's a little bit weird (setglobal!
in 1.9) to get types defined in an extension available at the top-level module again.
This is almost assuredly a breaking change because deserialization behavior will change when TimeZones isn't loaded either directly or through some other package. For serialization, I feel like it's less of an issue because if you've writing ZonedDateTime
to disk, then you've already got TimeZones loaded. Still, I thought this was a nice experiment to show a potential speed up. I'm opening this PR as the starting point of a discussion, not because I expect it to be merged immediately.
current main
julia> @time @time_imports using Arrow
1.7 ms LoggingExtras
2.7 ms DataAPI
1.3 ms DataValueInterfaces
1.1 ms IteratorInterfaceExtensions
1.1 ms TableTraits
39.3 ms Tables
22.9 ms SentinelArrays
12.5 ms PooledArrays
1.0 ms Lz4_jll
2.5 ms TranscodingStreams
3.2 ms CodecLz4
0.7 ms Zstd_jll
1.8 ms CEnum
3.7 ms CodecZstd
0.6 ms Scratch
0.4 ms PrecompileTools
8.8 ms RecipesBase
19.8 ms Parsers
5.6 ms InlineStrings
0.7 ms Compat
0.5 ms Compat → CompatLinearAlgebraExt
0.4 ms ExprTools
0.8 ms Mocking
346.8 ms TimeZones 83.55% compilation time (40% recompilation)
11.2 ms BitIntegers
2.7 ms ConcurrentUtilities
0.8 ms EnumX
3.3 ms ArrowTypes
22.5 ms Arrow
0.606927 seconds (1.62 M allocations: 100.383 MiB, 4.84% gc time, 56.60% compilation time: 39% of which was recompilation)
as an extension
julia> @time @time_imports using Arrow
1.5 ms LoggingExtras
0.8 ms DataAPI
0.4 ms DataValueInterfaces
0.4 ms IteratorInterfaceExtensions
0.4 ms TableTraits
20.7 ms Tables
17.5 ms SentinelArrays
12.2 ms PooledArrays
0.8 ms Lz4_jll
2.0 ms TranscodingStreams
3.3 ms CodecLz4
1.0 ms Zstd_jll
3.0 ms CEnum
3.1 ms CodecZstd
12.2 ms BitIntegers
2.1 ms ConcurrentUtilities
0.6 ms EnumX
2.8 ms ArrowTypes
20.7 ms Arrow
0.130832 seconds (201.42 k allocations: 13.516 MiB, 6.01% compilation time)
julia> @time @time_imports using TimeZones
0.6 ms Scratch
0.4 ms PrecompileTools
9.2 ms RecipesBase
19.8 ms Parsers
5.2 ms InlineStrings
0.9 ms Compat
1.2 ms Compat → CompatLinearAlgebraExt
0.5 ms ExprTools
1.2 ms Mocking
341.1 ms TimeZones 82.49% compilation time (44% recompilation)
156.7 ms Arrow → ArrowTimeZonesExt
0.612256 seconds (1.42 M allocations: 87.153 MiB, 2.65% gc time, 54.20% compilation time: 43% of which was recompilation)
to do if this change is desired
- [ ] update documentation to reflect the change in behavior re TimeZones
This is almost assuredly a breaking change because deserialization behavior will change when TimeZones isn't loaded either directly or through some other package
Regardless of breakingness, I'm not sure this is a good idea, since the Arrow format itself has a notion of timezone
, I think it makes sense that Arrow.jl itself can deserialize a table with such types (e.g. coming from another language), without needing to load another package.
IMO it would be better to try to fix invalidation / other sources of re-compilation in TimeZones itself, or otherwise improve it's load time, rather than making it optional.
Codecov Report
Merging #482 (899bddb) into main (95efe95) will increase coverage by
0.01%
. The diff coverage is88.88%
.
@@ Coverage Diff @@
## main #482 +/- ##
==========================================
+ Coverage 87.45% 87.46% +0.01%
==========================================
Files 26 27 +1
Lines 3283 3286 +3
==========================================
+ Hits 2871 2874 +3
Misses 412 412
Files Changed | Coverage Δ | |
---|---|---|
src/Arrow.jl | 97.43% <ø> (ø) |
|
src/eltypes.jl | 85.37% <ø> (-0.39%) |
:arrow_down: |
ext/ArrowTimeZonesExt.jl | 88.88% <88.88%> (ø) |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
I also wasn't sure that this was the best path -- I really did it just to see what would happen. And I also agree that the (re)compilation overhead in TimeZones is a good target for optimization.
That said, there is a different path forward (which I'm not saying is a good idea): define a very restricted Arrow.ArrowTimeZone
type and and make it easy to convert to ZonedDateTime
when TimeZones is loaded. If you don't need to manipulate the columns with timezones (or have none), then you can skip a dependency, while still being able to deserialize a file.
Ugh, we just really need a cleaned up/nice TimeZones.jl package where people don't feel like they need to work-around/avoid it. That's possibly the problem of the Dates stdlib making it easier for TimeZones.jl to integrate cleanly.
Perhaps this is no longer needed when these TimeZones load time improvements are merged: https://github.com/JuliaTime/TimeZones.jl/pull/457.
With the release of TimeZones.jl 1.16 I decide to re-run the timings
Julia 1.9.4 with TimeZones 1.15:
(Arrow) pkg> st
Project Arrow v2.7.2
Status `~/.julia/dev/Arrow/Project.toml`
[31f734f8] ArrowTypes v2.3.0 `src/ArrowTypes`
[c3b6d118] BitIntegers v0.3.1
[5ba52731] CodecLz4 v0.4.3
[6b39b394] CodecZstd v0.8.2
[f0e56b4a] ConcurrentUtilities v2.4.1
[9a962f9c] DataAPI v1.16.0
[4e289a0a] EnumX v1.0.4
[e6f89c97] LoggingExtras v1.0.3
[2dfb63ee] PooledArrays v1.4.3
[91c51154] SentinelArrays v1.4.3
[bd369af6] Tables v1.11.1
⌃ [f269a46b] TimeZones v1.15.0
[3bb67fe8] TranscodingStreams v0.10.8
[ade2ca70] Dates
[a63ad114] Mmap
[cf7118a7] UUIDs
Info Packages marked with ⌃ have new versions available and may be upgradable.
julia> @time @time_imports using Arrow
1.6 ms LoggingExtras
0.8 ms DataAPI
0.4 ms DataValueInterfaces
0.4 ms IteratorInterfaceExtensions
0.3 ms TableTraits
4.2 ms OrderedCollections
17.0 ms Tables
17.1 ms SentinelArrays
10.9 ms PooledArrays
7.7 ms Preferences
0.5 ms JLLWrappers
27.5 ms Lz4_jll 93.79% compilation time
1.4 ms TranscodingStreams
0.4 ms TranscodingStreams → TestExt
3.9 ms CodecLz4
0.8 ms Zstd_jll
2.6 ms CodecZstd
0.4 ms Scratch
0.3 ms PrecompileTools
16.6 ms Parsers
4.3 ms InlineStrings
0.4 ms TZJData
0.7 ms Compat
0.4 ms Compat → CompatLinearAlgebraExt
0.5 ms ExprTools
0.8 ms Mocking
253.5 ms TimeZones 55.54% compilation time (79% recompilation)
10.9 ms BitIntegers
1.8 ms ConcurrentUtilities
0.5 ms EnumX
2.8 ms ArrowTypes
152.1 ms Arrow
0.621405 seconds (1.35 M allocations: 83.477 MiB, 4.91% gc time, 31.84% compilation time: 56% of which was recompilation)
Julia 1.9.4 and TimeZones 1.16:
(Arrow) pkg> st
Project Arrow v2.7.2
Status `~/.julia/dev/Arrow/Project.toml`
[31f734f8] ArrowTypes v2.3.0 `src/ArrowTypes`
[c3b6d118] BitIntegers v0.3.1
[5ba52731] CodecLz4 v0.4.3
[6b39b394] CodecZstd v0.8.2
[f0e56b4a] ConcurrentUtilities v2.4.1
[9a962f9c] DataAPI v1.16.0
[4e289a0a] EnumX v1.0.4
[e6f89c97] LoggingExtras v1.0.3
[2dfb63ee] PooledArrays v1.4.3
[91c51154] SentinelArrays v1.4.3
[bd369af6] Tables v1.11.1
[f269a46b] TimeZones v1.16.0
[3bb67fe8] TranscodingStreams v0.10.8
[ade2ca70] Dates
[a63ad114] Mmap
[cf7118a7] UUIDs
julia> @time @time_imports using Arrow
1.7 ms LoggingExtras
0.7 ms DataAPI
0.4 ms DataValueInterfaces
0.3 ms IteratorInterfaceExtensions
0.3 ms TableTraits
4.3 ms OrderedCollections
16.6 ms Tables
17.3 ms SentinelArrays
11.0 ms PooledArrays
7.4 ms Preferences
0.4 ms JLLWrappers
27.8 ms Lz4_jll 92.37% compilation time
1.4 ms TranscodingStreams
0.4 ms TranscodingStreams → TestExt
3.6 ms CodecLz4
0.8 ms Zstd_jll
2.7 ms CodecZstd
0.4 ms Scratch
0.3 ms PrecompileTools
16.2 ms Parsers
4.3 ms InlineStrings
0.5 ms TZJData
0.8 ms Compat
1.8 ms Compat → CompatLinearAlgebraExt
0.5 ms ExprTools
1.1 ms Mocking
28.9 ms TimeZones
25.9 ms BitIntegers
1.7 ms ConcurrentUtilities
0.6 ms EnumX
2.9 ms ArrowTypes
259.1 ms Arrow
0.523419 seconds (574.55 k allocations: 40.298 MiB, 2.85% gc time, 11.04% compilation time)
The increase in import time for Arrow is most likely to is due to having Arrow.jl code triggering the loading of time zone data.
It looks like the use of @tz_str
is the issue at the moment. I updated to use TimeZone(...)
and got this:
julia> @time @time_imports using Arrow
...
25.9 ms TimeZones
...
2.8 ms ArrowTypes
19.3 ms Arrow
0.284661 seconds (574.62 k allocations: 41.319 MiB, 14.86% gc time, 20.92% compilation time)
I'll attempt to address this in TimeZones.jl