arrow-julia icon indicating copy to clipboard operation
arrow-julia copied to clipboard

Serializing `Vector{Array}` flattens elements

Open kleinschmidt opened this issue 4 years ago • 5 comments

julia> tab2 = (x = [rand(2,2) for _ in 1:10], );

julia> io2 = IOBuffer()
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf, ptr=1, mark=-1)

julia> Arrow.write(io2, tab2)
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=736, maxsize=Inf, ptr=737, mark=-1)

julia> seekstart(io2)
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=736, maxsize=Inf, ptr=1, mark=-1)

julia> tab2_in = Arrow.Table(io2)
Arrow.Table: (x = [[0.0854794818573692, 0.6156692467610922, 0.4816953048461168, 0.10732732583276494], [0.2699150559983352, 0.7190134247010207, 0.4620847068434688, 0.8086969500721466], [0.32599895190448236, 0.5913226835708243, 0.914412039189433, 0.04825070891100114], [0.19800356830304522, 0.30969840980302754, 0.7586621051179192, 0.8383737966971678], [0.46353328096692525, 0.7946259468169095, 0.9205855777866754, 0.30143398878132244], [0.7980760486814706, 0.3204359773634282, 0.5253557751574975, 0.8140218988518082], [0.9662048857122691, 0.17369565835410516, 0.18328845062159216, 0.5595773503180002], [0.5423562251958653, 0.6285041382249899, 0.31820822802763304, 0.42185178416765723], [0.6985239393699472, 0.8842247274247905, 0.5466899110422125, 0.2058087506887174], [0.352616487112392, 0.6636744200136235, 0.9140196029380305, 0.2594771305442407]],)

julia> tab2_in.x
10-element Arrow.List{Vector{Float64}, Int32, Arrow.Primitive{Float64, Vector{Float64}}}:
 [0.0854794818573692, 0.6156692467610922, 0.4816953048461168, 0.10732732583276494]
 [0.2699150559983352, 0.7190134247010207, 0.4620847068434688, 0.8086969500721466]
 [0.32599895190448236, 0.5913226835708243, 0.914412039189433, 0.04825070891100114]
 [0.19800356830304522, 0.30969840980302754, 0.7586621051179192, 0.8383737966971678]
 [0.46353328096692525, 0.7946259468169095, 0.9205855777866754, 0.30143398878132244]
 [0.7980760486814706, 0.3204359773634282, 0.5253557751574975, 0.8140218988518082]
 [0.9662048857122691, 0.17369565835410516, 0.18328845062159216, 0.5595773503180002]
 [0.5423562251958653, 0.6285041382249899, 0.31820822802763304, 0.42185178416765723]
 [0.6985239393699472, 0.8842247274247905, 0.5466899110422125, 0.2058087506887174]
 [0.352616487112392, 0.6636744200136235, 0.9140196029380305, 0.2594771305442407]

At first I thought this was just some cleverness with how Arrow was serializing a "column matrix" (e.g. a n-by-1 matrix). I guess it's probably not possible to store the metadata about the size of the elements if each element can potentially have its own size? Maybe using StaticArrays is called for?

kleinschmidt avatar Feb 05 '21 19:02 kleinschmidt

Oof, yeah, this is tricky. Sorry for the delay in responding btw. We've had the same issue w/ JSON3 serializing; it's just tricky to know what kind of Matrix you want to deserialize. I have some ideas on introducing a Arrow.lower and Arrow.construct that would allow custom types to "hook" into the serialize/deserialize process; that would enable you do wrap a Matrix in a custom type and then on deserialize, you could reinterpret to the size you want, assuming you know the sizes.

quinnj avatar Mar 06 '21 07:03 quinnj

Though perhaps the actual right way to support Matrix is do implement the Tensor serialization: https://arrow.apache.org/docs/format/Other.html

quinnj avatar Mar 06 '21 07:03 quinnj

I ran into this with LegolasFlux; there I worked around it with a manual "FlatArray" type: https://github.com/beacon-biosignals/LegolasFlux.jl/blob/80569ab63a8248a8a063c76e0bbf701f4ada9bd4/src/LegolasFlux.jl#L15-L36

This is also coming up in Lighthouse.jl: https://github.com/beacon-biosignals/Lighthouse.jl/pull/45/files#diff-790bf7705c3c1ba479ac3f9931bc82ec801c90f412d218ab4f9f88020fc128beR1-R16

ericphanson avatar Mar 17 '22 21:03 ericphanson

@ericphanson, what do you think, should we do a custom serialization for AbstractArray{T, N} where N > 1? We could serialize N and what size returns to be able to correctly deserialize?

The alternative, as I mentioned above, would be to try and use the official tensor message type.

Pros of tensor message:

  • official arrow way of doing tensors Cons of tensor:
  • Not all implementations support the tensor message

Pros of using extension type (metadata):

  • Pretty trivial implementation (basically what @ericphanson linked to above w/ the FlatArray type) Cons:
  • Would make non-vector AbstractArrays Julia-specific w/ their serialization

We could probably still support the tensor message type later if we go w/ the extension type method, but it would be an explicit thing where you wrap your matrix/array in a Arrow.Tensor and then write that out.

quinnj avatar Jun 14 '23 21:06 quinnj

Hm, I think wrapping is inconvenient when you might have a nested structure you would need to traverse to wrap each leaf. Interop with pyarrow would be nice though, especially if one could do zero-copy IPC to numpy arrays or similar. I know @omus is looking into interop with Ray so maybe he has an opinion about what would work best for that.

ericphanson avatar Jun 14 '23 22:06 ericphanson