arrow-julia
arrow-julia copied to clipboard
Question on `Date` encoding
I'm trying to use Arrow to send data between a Julia (Arrows.jl) and a Rust (Polars) app. However, when I write a table containing Date, it is read by Polars as Extension("JuliaLang.Date", Date32, Some("")), and Polars complains with
Cannot create polars series from Extension("JuliaLang.Date", Date32, Some("")) type
I would have expected one of the following things would happen:
- When writing the Arrow file, the type is converted to a suitable Arrow type (e.g. Date32 ),disguising its origin.
- When reading the Arrow file, Polars would ignore the origin (JuliaLang.Date, which seems to just be a name) and recognize it as Date32 or whatever.
Instead, what seems to happen is that it is encoded as an extension type and Polars does not know what to do with it. Is this expected behavior?
I am noticing something similar. Using polars.read_ipc() works but polars.scan_ipc() with a list of arrow files as the argument fails on parsing Julia DateTime format.
I tried both methods in Python and they did not work either, returning:
Cannot create polars series from Extension("JuliaLang.Date", Date32, Some("")) type for read_ipc and
Arrow datatype Extension("JuliaLang.Date", Date32, Some("")) not supported by Polars for scan_ipc.
(paraphrasing my response from Julia Slack)
I guess either the Julia implementation could be considered "overly aggressive" as a producer here by other Arrow consumers since it encodes an extension for such a simple Julia type that maps 1:1 with an Arrow type. Alternatively, the consumer in this case may be considered "overly aggressive" in its attempt to resolve an extension that it's unaware of. Or maybe both 😁
IMO the latter interpretation (that the consumer should loosen its assumptions) probably makes more sense, unless there's guidance otherwise from core Arrow on how implementations should negotiate extension/metadata usage? A motivating example: if a producer writes out a column of structs and includes extension metadata to map the struct back into some application-layer type, I'd still want my extension-agnostic consumer to read the data as a column of "plain structs" gracefully instead of fail.
(it's worth checking that the Julia implementation here doesn't suffer from the same problem - e.g. gracefully consumes Arrow data with unknown extensions. that'd be a different issue though)
What is the rationale for using an extension in this case, given that there is a standard for dates in Arrow? Shouldn't Arrow.jl just convert Dates.Date to that, and disguise its origin, such that consumers do not get tripped up? Conversely, Arrow.jl should be able to read the Arrow date formats and convert them to the native Julia formats without relying on extension metadata.
I'd have to check the implementation more deeply but I assume it's just because Arrow Date <-> Julia Date translation reuses the same generic mechanisms/interface provided by ArrowTypes.jl instead of "special-casing" dates as a "built-in" type.
I'm assuming we could add such special casing pretty easily (might as well), but I'm not sure it solves the underlying problem here - that consumers probably should be able to gracefully read any arrow types they support and probably remain agnostic to extension metadata that isn't within a "namespace" they're aware of/explicitly support (would be interesting to have the notion of namespaced metadata be formalized if it isn't already). this guidance might be worth upstreaming somewhere for consumer implementations if other folks find it sensible
Conversely, Arrow.jl should be able to read the Arrow date formats and convert them to the native Julia formats
yup, this should definitely already work (if it doesn't, please file a ticket!)
In terms of actual raw numbers, Arrow Date specification is the number of days since 1970, while Julia Date seems to be the number of days since 1 AD. So they are indeed different. When I read from a table like this
t=Arrow.Table("data/flat2.arrow", convert=false)
I do get Arrow.Date{Arrow.Flatbuf.DateUnitModule.DAY, Int32} as the datatype indeed.
However it appears that there is no similar convert flag when writing this same table with Arrow.write, as I still get the same error reading it in Rust Cannot create polars series from Extension("JuliaLang.Date", Date32, Some("")) type. I think we should have a way to serialize in the original Arrow format and not use custom types, in this case.
yup, this should definitely already work (if it doesn't, please file a ticket!)
I believe that this works, but if it indeed does, using the extension when writing Dates from Julia seems superfluous.
We just did an experiment. When we write the following arrow file from Julia
arrow_dt = [convert(Arrow.Date{Arrow.Flatbuf.DateUnits.DAY, Int32}, Date(2022) + Day(i-1)) for i in 1:10]
t = (; dt = arrow_dt)
Arrow.write("tmp.arrow", t)
and then open it using polars, it is read just fine. Presumably this bypasses the normal encoding interface mentioned above.
I reviewed the Arrow.jl source code and it seems to be fine. I believe the issue is that polars does not know anything about extension types. Doing pl.read_ipc("tmp.arrow", use_pyarrow = True) in polars (Python) works fine.
Presumably this bypasses the normal encoding interface mentioned above.
Indeed it does. Here is a slightly simplified conversion example, including one for DateTime:
using Arrow, Dates
convert(Arrow.DATE, today()) # -> Arrow.Date{Arrow.Flatbuf.DateUnits.DAY, Int32}(19402)
convert(Arrow.DATETIME, now()) # -> Arrow.Timestamp{Arrow.Flatbuf.TimeUnits.MILLISECOND, nothing}(1676386953375)