arrow-julia icon indicating copy to clipboard operation
arrow-julia copied to clipboard

move timezone support to extension

Open palday opened this issue 1 year ago • 7 comments

When looking at @time_imports on a transitive reverse dependency, I saw that TimeZones adds a substantial time to a package that doesn't need that part of Arrow. So I took a stab at moving time zone support to an extension. I've done this in a way that should still be Julia pre 1.9 compatible. There is one slightly hacky thing: the extension mechanism is based primarily on the idea of adding additional methods to existing functions and not for defining additional types. As such, it's a little bit weird (setglobal! in 1.9) to get types defined in an extension available at the top-level module again.

This is almost assuredly a breaking change because deserialization behavior will change when TimeZones isn't loaded either directly or through some other package. For serialization, I feel like it's less of an issue because if you've writing ZonedDateTime to disk, then you've already got TimeZones loaded. Still, I thought this was a nice experiment to show a potential speed up. I'm opening this PR as the starting point of a discussion, not because I expect it to be merged immediately.

current main

julia> @time @time_imports using Arrow
      1.7 ms  LoggingExtras
      2.7 ms  DataAPI
      1.3 ms  DataValueInterfaces
      1.1 ms  IteratorInterfaceExtensions
      1.1 ms  TableTraits
     39.3 ms  Tables
     22.9 ms  SentinelArrays
     12.5 ms  PooledArrays
      1.0 ms  Lz4_jll
      2.5 ms  TranscodingStreams
      3.2 ms  CodecLz4
      0.7 ms  Zstd_jll
      1.8 ms  CEnum
      3.7 ms  CodecZstd
      0.6 ms  Scratch
      0.4 ms  PrecompileTools
      8.8 ms  RecipesBase
     19.8 ms  Parsers
      5.6 ms  InlineStrings
      0.7 ms  Compat
      0.5 ms  Compat → CompatLinearAlgebraExt
      0.4 ms  ExprTools
      0.8 ms  Mocking
    346.8 ms  TimeZones 83.55% compilation time (40% recompilation)
     11.2 ms  BitIntegers
      2.7 ms  ConcurrentUtilities
      0.8 ms  EnumX
      3.3 ms  ArrowTypes
     22.5 ms  Arrow
  0.606927 seconds (1.62 M allocations: 100.383 MiB, 4.84% gc time, 56.60% compilation time: 39% of which was recompilation)

as an extension

julia> @time @time_imports using Arrow
      1.5 ms  LoggingExtras
      0.8 ms  DataAPI
      0.4 ms  DataValueInterfaces
      0.4 ms  IteratorInterfaceExtensions
      0.4 ms  TableTraits
     20.7 ms  Tables
     17.5 ms  SentinelArrays
     12.2 ms  PooledArrays
      0.8 ms  Lz4_jll
      2.0 ms  TranscodingStreams
      3.3 ms  CodecLz4
      1.0 ms  Zstd_jll
      3.0 ms  CEnum
      3.1 ms  CodecZstd
     12.2 ms  BitIntegers
      2.1 ms  ConcurrentUtilities
      0.6 ms  EnumX
      2.8 ms  ArrowTypes
     20.7 ms  Arrow
  0.130832 seconds (201.42 k allocations: 13.516 MiB, 6.01% compilation time)

julia> @time @time_imports using TimeZones
      0.6 ms  Scratch
      0.4 ms  PrecompileTools
      9.2 ms  RecipesBase
     19.8 ms  Parsers
      5.2 ms  InlineStrings
      0.9 ms  Compat
      1.2 ms  Compat → CompatLinearAlgebraExt
      0.5 ms  ExprTools
      1.2 ms  Mocking
    341.1 ms  TimeZones 82.49% compilation time (44% recompilation)
    156.7 ms  Arrow → ArrowTimeZonesExt
  0.612256 seconds (1.42 M allocations: 87.153 MiB, 2.65% gc time, 54.20% compilation time: 43% of which was recompilation)

to do if this change is desired

  • [ ] update documentation to reflect the change in behavior re TimeZones

palday avatar Aug 14 '23 19:08 palday

This is almost assuredly a breaking change because deserialization behavior will change when TimeZones isn't loaded either directly or through some other package

Regardless of breakingness, I'm not sure this is a good idea, since the Arrow format itself has a notion of timezone, I think it makes sense that Arrow.jl itself can deserialize a table with such types (e.g. coming from another language), without needing to load another package.

IMO it would be better to try to fix invalidation / other sources of re-compilation in TimeZones itself, or otherwise improve it's load time, rather than making it optional.

ericphanson avatar Aug 16 '23 10:08 ericphanson

Codecov Report

Merging #482 (899bddb) into main (95efe95) will increase coverage by 0.01%. The diff coverage is 88.88%.

@@            Coverage Diff             @@
##             main     #482      +/-   ##
==========================================
+ Coverage   87.45%   87.46%   +0.01%     
==========================================
  Files          26       27       +1     
  Lines        3283     3286       +3     
==========================================
+ Hits         2871     2874       +3     
  Misses        412      412              
Files Changed Coverage Δ
src/Arrow.jl 97.43% <ø> (ø)
src/eltypes.jl 85.37% <ø> (-0.39%) :arrow_down:
ext/ArrowTimeZonesExt.jl 88.88% <88.88%> (ø)

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov-commenter avatar Aug 16 '23 10:08 codecov-commenter

I also wasn't sure that this was the best path -- I really did it just to see what would happen. And I also agree that the (re)compilation overhead in TimeZones is a good target for optimization.

That said, there is a different path forward (which I'm not saying is a good idea): define a very restricted Arrow.ArrowTimeZone type and and make it easy to convert to ZonedDateTime when TimeZones is loaded. If you don't need to manipulate the columns with timezones (or have none), then you can skip a dependency, while still being able to deserialize a file.

palday avatar Aug 16 '23 15:08 palday

Ugh, we just really need a cleaned up/nice TimeZones.jl package where people don't feel like they need to work-around/avoid it. That's possibly the problem of the Dates stdlib making it easier for TimeZones.jl to integrate cleanly.

quinnj avatar Aug 16 '23 16:08 quinnj

Perhaps this is no longer needed when these TimeZones load time improvements are merged: https://github.com/JuliaTime/TimeZones.jl/pull/457.

visr avatar Apr 26 '24 21:04 visr

With the release of TimeZones.jl 1.16 I decide to re-run the timings

Julia 1.9.4 with TimeZones 1.15:

(Arrow) pkg> st
Project Arrow v2.7.2
Status `~/.julia/dev/Arrow/Project.toml`
  [31f734f8] ArrowTypes v2.3.0 `src/ArrowTypes`
  [c3b6d118] BitIntegers v0.3.1
  [5ba52731] CodecLz4 v0.4.3
  [6b39b394] CodecZstd v0.8.2
  [f0e56b4a] ConcurrentUtilities v2.4.1
  [9a962f9c] DataAPI v1.16.0
  [4e289a0a] EnumX v1.0.4
  [e6f89c97] LoggingExtras v1.0.3
  [2dfb63ee] PooledArrays v1.4.3
  [91c51154] SentinelArrays v1.4.3
  [bd369af6] Tables v1.11.1
⌃ [f269a46b] TimeZones v1.15.0
  [3bb67fe8] TranscodingStreams v0.10.8
  [ade2ca70] Dates
  [a63ad114] Mmap
  [cf7118a7] UUIDs
Info Packages marked with ⌃ have new versions available and may be upgradable.

julia> @time @time_imports using Arrow
      1.6 ms  LoggingExtras
      0.8 ms  DataAPI
      0.4 ms  DataValueInterfaces
      0.4 ms  IteratorInterfaceExtensions
      0.3 ms  TableTraits
      4.2 ms  OrderedCollections
     17.0 ms  Tables
     17.1 ms  SentinelArrays
     10.9 ms  PooledArrays
      7.7 ms  Preferences
      0.5 ms  JLLWrappers
     27.5 ms  Lz4_jll 93.79% compilation time
      1.4 ms  TranscodingStreams
      0.4 ms  TranscodingStreams → TestExt
      3.9 ms  CodecLz4
      0.8 ms  Zstd_jll
      2.6 ms  CodecZstd
      0.4 ms  Scratch
      0.3 ms  PrecompileTools
     16.6 ms  Parsers
      4.3 ms  InlineStrings
      0.4 ms  TZJData
      0.7 ms  Compat
      0.4 ms  Compat → CompatLinearAlgebraExt
      0.5 ms  ExprTools
      0.8 ms  Mocking
    253.5 ms  TimeZones 55.54% compilation time (79% recompilation)
     10.9 ms  BitIntegers
      1.8 ms  ConcurrentUtilities
      0.5 ms  EnumX
      2.8 ms  ArrowTypes
    152.1 ms  Arrow
  0.621405 seconds (1.35 M allocations: 83.477 MiB, 4.91% gc time, 31.84% compilation time: 56% of which was recompilation)

Julia 1.9.4 and TimeZones 1.16:

(Arrow) pkg> st
Project Arrow v2.7.2
Status `~/.julia/dev/Arrow/Project.toml`
  [31f734f8] ArrowTypes v2.3.0 `src/ArrowTypes`
  [c3b6d118] BitIntegers v0.3.1
  [5ba52731] CodecLz4 v0.4.3
  [6b39b394] CodecZstd v0.8.2
  [f0e56b4a] ConcurrentUtilities v2.4.1
  [9a962f9c] DataAPI v1.16.0
  [4e289a0a] EnumX v1.0.4
  [e6f89c97] LoggingExtras v1.0.3
  [2dfb63ee] PooledArrays v1.4.3
  [91c51154] SentinelArrays v1.4.3
  [bd369af6] Tables v1.11.1
  [f269a46b] TimeZones v1.16.0
  [3bb67fe8] TranscodingStreams v0.10.8
  [ade2ca70] Dates
  [a63ad114] Mmap
  [cf7118a7] UUIDs

julia> @time @time_imports using Arrow
      1.7 ms  LoggingExtras
      0.7 ms  DataAPI
      0.4 ms  DataValueInterfaces
      0.3 ms  IteratorInterfaceExtensions
      0.3 ms  TableTraits
      4.3 ms  OrderedCollections
     16.6 ms  Tables
     17.3 ms  SentinelArrays
     11.0 ms  PooledArrays
      7.4 ms  Preferences
      0.4 ms  JLLWrappers
     27.8 ms  Lz4_jll 92.37% compilation time
      1.4 ms  TranscodingStreams
      0.4 ms  TranscodingStreams → TestExt
      3.6 ms  CodecLz4
      0.8 ms  Zstd_jll
      2.7 ms  CodecZstd
      0.4 ms  Scratch
      0.3 ms  PrecompileTools
     16.2 ms  Parsers
      4.3 ms  InlineStrings
      0.5 ms  TZJData
      0.8 ms  Compat
      1.8 ms  Compat → CompatLinearAlgebraExt
      0.5 ms  ExprTools
      1.1 ms  Mocking
     28.9 ms  TimeZones
     25.9 ms  BitIntegers
      1.7 ms  ConcurrentUtilities
      0.6 ms  EnumX
      2.9 ms  ArrowTypes
    259.1 ms  Arrow
  0.523419 seconds (574.55 k allocations: 40.298 MiB, 2.85% gc time, 11.04% compilation time)

The increase in import time for Arrow is most likely to is due to having Arrow.jl code triggering the loading of time zone data.

omus avatar May 21 '24 20:05 omus

It looks like the use of @tz_str is the issue at the moment. I updated to use TimeZone(...) and got this:

julia> @time @time_imports using Arrow
...
     25.9 ms  TimeZones
...
      2.8 ms  ArrowTypes
     19.3 ms  Arrow
  0.284661 seconds (574.62 k allocations: 41.319 MiB, 14.86% gc time, 20.92% compilation time)

I'll attempt to address this in TimeZones.jl

omus avatar May 21 '24 20:05 omus