arrow-julia icon indicating copy to clipboard operation
arrow-julia copied to clipboard

NTuple with custom type and compression

Open poncito opened this issue 3 years ago • 2 comments

Hello,

I have a custom type defined this way:

struct Char8 <: AbstractChar
    x::UInt8
end
Char8(x::Integer) = Char8(UInt8(x))
Base.codepoint(c::Char8) = UInt32(c.x)

and serialized this way

ArrowTypes.ArrowKind(::Type{Char8}) = ArrowTypes.PrimitiveKind()
ArrowTypes.ArrowType(::Type{Char8}) = UInt8
const CHAR8 = Symbol("JuliaLang.Char8")
ArrowTypes.arrowname(::Type{Char8}) = CHAR8
ArrowTypes.toarrow(x::Char8) = x.x
ArrowTypes.fromarrow(::Type{Char8}, x::UInt8) = Char8(x)
ArrowTypes.JuliaType(::Val{CHAR8}) = Char8

The following throws:

a=[(Char8(1),Char8(2))]
table = (col1=a,)
io = IOBuffer()
Arrow.write(io, table;compress=:zstd)

but only when the compression is enabled. Is that expected?

I also noticed that the ArrowType seems wrong, because it calls the identity function. So, why not setting the following default, ArrowTypes.ArrowType(::Type{NTuple{N, T}}) where {N, T} = NTuple{N, ArrowTypes.ArrowType(T)} ? This line solves this issue in my case.

Thanks,

poncito avatar Jan 31 '22 19:01 poncito

Sorry for the slow response; thanks for the report. Could you post the exact error you're seeing? Could you also explain what exactly you mean by

I also noticed that the ArrowType seems wrong, because it calls the identity function

I'm not sure on the context for you provided definition that solves your issue.

quinnj avatar Feb 11 '22 22:02 quinnj

Hi Jacob,

Sorry for the delay! This code replicates:

using Arrow

struct Char8 <: AbstractChar
    x::UInt8
end
Char8(x::Integer) = Char8(UInt8(x))
Base.codepoint(c::Char8) = UInt32(c.x)

ArrowTypes.ArrowKind(::Type{Char8}) = ArrowTypes.PrimitiveKind()
ArrowTypes.ArrowType(::Type{Char8}) = UInt8
ArrowTypes.arrowname(::Type{Char8}) = Symbol("JuliaLang.Char8")
ArrowTypes.toarrow(x::Char8) = x.x

table = (col1=[(Char8(1),Char8(2))],)
Arrow.write(IOBuffer(), table;compress=:zstd)

The stack is:

ERROR: LoadError: MethodError: Cannot `convert` an object of type Arrow.Compressed{Arrow.Flatbuf.CompressionTypeModule.ZSTD, Arrow.Primitive{UInt8, ArrowTypes.ToArrow{UInt8, Arrow.ToFixedSizeList{Char8, 2, Vector{Tuple{Char8, Char8}}}}}} to an object of type Arrow.CompressedBuffer
Closest candidates are:
  convert(::Type{T}, ::T) where T at essentials.jl:205
  Arrow.CompressedBuffer(::Any, ::Any) at /home/romain/.julia/packages/Arrow/x6smw/src/arraytypes/compressed.jl:18
Stacktrace:
  [1] push!(a::Vector{Arrow.CompressedBuffer}, item::Arrow.Compressed{Arrow.Flatbuf.CompressionTypeModule.ZSTD, Arrow.Primitive{UInt8, ArrowTypes.ToArrow{UInt8, Arrow.ToFixedSizeList{Char8, 2, Vector{Tuple{Char8, Char8}}}}}})
    @ Base ./array.jl:928
  [2] compress(Z::Arrow.Flatbuf.CompressionTypeModule.CompressionType, comp::CodecZstd.ZstdCompressor, x::Arrow.FixedSizeList{Tuple{UInt8, UInt8}, Arrow.Primitive{UInt8, ArrowTypes.ToArrow{UInt8, Arrow.ToFixedSizeList{Char8, 2, Vector{Tuple{Char8, Char8}}}}}})
    @ Arrow ~/.julia/packages/Arrow/x6smw/src/arraytypes/fixedsizelist.jl:131
  [3] toarrowvector(x::Vector{Tuple{Char8, Char8}}, i::Int64, de::Dict{Int64, Any}, ded::Vector{Arrow.DictEncoding}, meta::Nothing; compression::Vector{CodecZstd.ZstdCompressor}, kw::Base.Iterators.Pairs{Symbol, Integer, NTuple{5, Symbol}, NamedTuple{(:largelists, :denseunions, :dictencode, :dictencodenested, :maxdepth), Tuple{Bool, Bool, Bool, Bool, Int64}}})
    @ Arrow ~/.julia/packages/Arrow/x6smw/src/arraytypes/arraytypes.jl:44
  [4] (::Arrow.var"#113#114"{Dict{Int64, Any}, Bool, Vector{CodecZstd.ZstdCompressor}, Bool, Bool, Bool, Int64, Nothing, Vector{Arrow.DictEncoding}, Vector{Type}, Vector{Any}})(col::Vector{Tuple{Char8, Char8}}, i::Int64, nm::Symbol)
    @ Arrow ~/.julia/packages/Arrow/x6smw/src/write.jl:216
  [5] eachcolumn
    @ ~/.julia/packages/Tables/OWzlh/src/utils.jl:70 [inlined]
  [6] toarrowtable(cols::NamedTuple{(:col1,), Tuple{Vector{Tuple{Char8, Char8}}}}, dictencodings::Dict{Int64, Any}, largelists::Bool, compress::Vector{CodecZstd.ZstdCompressor}, denseunions::Bool, dictencode::Bool, dictencodenested::Bool, maxdepth::Int64, meta::Nothing, colmeta::Nothing)
    @ Arrow ~/.julia/packages/Arrow/x6smw/src/write.jl:213
  [7] macro expansion
    @ ~/.julia/packages/Arrow/x6smw/src/write.jl:109 [inlined]
  [8] macro expansion
    @ ./task.jl:387 [inlined]
  [9] write(io::IOBuffer, source::NamedTuple{(:col1,), Tuple{Vector{Tuple{Char8, Char8}}}}, writetofile::Bool, largelists::Bool, compress::Symbol, denseunions::Bool, dictencode::Bool, dictencodenested::Bool, alignment::Int64, maxdepth::Int64, ntasks::Float64, meta::Nothing, colmeta::Nothing)
    @ Arrow ~/.julia/packages/Arrow/x6smw/src/write.jl:101
 [10] #write#102
    @ ~/.julia/packages/Arrow/x6smw/src/write.jl:64 [inlined]
 [11] top-level scope
    @ Untitled-3:15
in expression starting at Untitled-3:15

What I understand from the stack is that compres for a primitive returns a Compressed, and thus we cannot push directly into buffers here, which expects elements of type CompressedBuffer. I believe removing this branch makes it work.

For my curiosity/understanding, what is the point of that last branch? What's the point of pushing into Compressed.buffers rather than Compressed.children?

poncito avatar Mar 03 '22 11:03 poncito