arrow-julia
arrow-julia copied to clipboard
Serializing `Vector{Array}` flattens elements
julia> tab2 = (x = [rand(2,2) for _ in 1:10], );
julia> io2 = IOBuffer()
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf, ptr=1, mark=-1)
julia> Arrow.write(io2, tab2)
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=736, maxsize=Inf, ptr=737, mark=-1)
julia> seekstart(io2)
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=736, maxsize=Inf, ptr=1, mark=-1)
julia> tab2_in = Arrow.Table(io2)
Arrow.Table: (x = [[0.0854794818573692, 0.6156692467610922, 0.4816953048461168, 0.10732732583276494], [0.2699150559983352, 0.7190134247010207, 0.4620847068434688, 0.8086969500721466], [0.32599895190448236, 0.5913226835708243, 0.914412039189433, 0.04825070891100114], [0.19800356830304522, 0.30969840980302754, 0.7586621051179192, 0.8383737966971678], [0.46353328096692525, 0.7946259468169095, 0.9205855777866754, 0.30143398878132244], [0.7980760486814706, 0.3204359773634282, 0.5253557751574975, 0.8140218988518082], [0.9662048857122691, 0.17369565835410516, 0.18328845062159216, 0.5595773503180002], [0.5423562251958653, 0.6285041382249899, 0.31820822802763304, 0.42185178416765723], [0.6985239393699472, 0.8842247274247905, 0.5466899110422125, 0.2058087506887174], [0.352616487112392, 0.6636744200136235, 0.9140196029380305, 0.2594771305442407]],)
julia> tab2_in.x
10-element Arrow.List{Vector{Float64}, Int32, Arrow.Primitive{Float64, Vector{Float64}}}:
[0.0854794818573692, 0.6156692467610922, 0.4816953048461168, 0.10732732583276494]
[0.2699150559983352, 0.7190134247010207, 0.4620847068434688, 0.8086969500721466]
[0.32599895190448236, 0.5913226835708243, 0.914412039189433, 0.04825070891100114]
[0.19800356830304522, 0.30969840980302754, 0.7586621051179192, 0.8383737966971678]
[0.46353328096692525, 0.7946259468169095, 0.9205855777866754, 0.30143398878132244]
[0.7980760486814706, 0.3204359773634282, 0.5253557751574975, 0.8140218988518082]
[0.9662048857122691, 0.17369565835410516, 0.18328845062159216, 0.5595773503180002]
[0.5423562251958653, 0.6285041382249899, 0.31820822802763304, 0.42185178416765723]
[0.6985239393699472, 0.8842247274247905, 0.5466899110422125, 0.2058087506887174]
[0.352616487112392, 0.6636744200136235, 0.9140196029380305, 0.2594771305442407]
At first I thought this was just some cleverness with how Arrow was serializing a "column matrix" (e.g. a n-by-1 matrix). I guess it's probably not possible to store the metadata about the size of the elements if each element can potentially have its own size? Maybe using StaticArrays is called for?
Oof, yeah, this is tricky. Sorry for the delay in responding btw. We've had the same issue w/ JSON3 serializing; it's just tricky to know what kind of Matrix you want to deserialize. I have some ideas on introducing a Arrow.lower and Arrow.construct that would allow custom types to "hook" into the serialize/deserialize process; that would enable you do wrap a Matrix in a custom type and then on deserialize, you could reinterpret to the size you want, assuming you know the sizes.
Though perhaps the actual right way to support Matrix is do implement the Tensor serialization: https://arrow.apache.org/docs/format/Other.html
I ran into this with LegolasFlux; there I worked around it with a manual "FlatArray" type: https://github.com/beacon-biosignals/LegolasFlux.jl/blob/80569ab63a8248a8a063c76e0bbf701f4ada9bd4/src/LegolasFlux.jl#L15-L36
This is also coming up in Lighthouse.jl: https://github.com/beacon-biosignals/Lighthouse.jl/pull/45/files#diff-790bf7705c3c1ba479ac3f9931bc82ec801c90f412d218ab4f9f88020fc128beR1-R16
@ericphanson, what do you think, should we do a custom serialization for AbstractArray{T, N} where N > 1? We could serialize N and what size returns to be able to correctly deserialize?
The alternative, as I mentioned above, would be to try and use the official tensor message type.
Pros of tensor message:
- official arrow way of doing tensors Cons of tensor:
- Not all implementations support the tensor message
Pros of using extension type (metadata):
- Pretty trivial implementation (basically what @ericphanson linked to above w/ the FlatArray type) Cons:
- Would make non-vector AbstractArrays Julia-specific w/ their serialization
We could probably still support the tensor message type later if we go w/ the extension type method, but it would be an explicit thing where you wrap your matrix/array in a Arrow.Tensor and then write that out.
Hm, I think wrapping is inconvenient when you might have a nested structure you would need to traverse to wrap each leaf. Interop with pyarrow would be nice though, especially if one could do zero-copy IPC to numpy arrays or similar. I know @omus is looking into interop with Ray so maybe he has an opinion about what would work best for that.