Feather.jl
Feather.jl copied to clipboard
What are the constraints on the types of data in a `DataFrame` for `Feather.write` to apply
I have a DataFrame
as following:
julia> test_data
15×3 DataFrame
│ Row │ PetalLength │ PetalWidth │ Species │
│ │ Float64 │ Float64 │ Cat… │
├─────┼─────────────┼────────────┼────────────┤
│ 1 │ 1.6 │ 0.2 │ setosa │
│ 2 │ 1.7 │ 0.3 │ setosa │
│ 3 │ 1.6 │ 0.2 │ setosa │
│ 4 │ 1.5 │ 0.1 │ setosa │
│ 5 │ 1.4 │ 0.2 │ setosa │
│ 6 │ 1.3 │ 0.2 │ setosa │
│ 7 │ 1.5 │ 0.2 │ setosa │
│ 8 │ 4.5 │ 1.5 │ versicolor │
│ 9 │ 4.9 │ 1.5 │ versicolor │
│ 10 │ 4.4 │ 1.2 │ versicolor │
│ 11 │ 5.9 │ 2.1 │ virginica │
│ 12 │ 5.1 │ 2.0 │ virginica │
│ 13 │ 6.0 │ 1.8 │ virginica │
│ 14 │ 5.6 │ 2.4 │ virginica │
│ 15 │ 5.2 │ 2.3 │ virginica │
,where the type of :Species
is CategoricalValue{String,UInt8}
.
Now I try to store it in a feather format and an error occurs,
julia> Feather.write("test_data.feather",test_data)
ERROR: type CategoricalPool has no field index
Stacktrace:
[1] getproperty(::CategoricalPool{String,UInt8,CategoricalValue{String,UInt8}}, ::Symbol) at .\Base.jl:33
[2] getlevels(::CategoricalArray{String,1,UInt8,String,CategoricalValue{String,UInt8},Union{}}) at C:\Users\dongjx\.julia\packages\Arrow\q3tEJ\src\dictencoding.jl:167
[3] Arrow.DictEncoding(::CategoricalArray{String,1,UInt8,String,CategoricalValue{String,UInt8},Union{}}) at C:\Users\dongjx\.julia\packages\Arrow\q3tEJ\src\dictencoding.jl:68
[4] arrowformat(::CategoricalArray{String,1,UInt8,String,CategoricalValue{String,UInt8},Union{}}) at C:\Users\dongjx\.julia\packages\Arrow\q3tEJ\src\arrowvectors.jl:242
[5] getarrow(::CategoricalArray{String,1,UInt8,String,CategoricalValue{String,UInt8},Union{}}) at C:\Users\dongjx\.julia\packages\Feather\pbm3o\src\sink.jl:37
[6] write(::IOStream, ::DataFrame; description::String, metadata::String) at C:\Users\dongjx\.julia\packages\Feather\pbm3o\src\sink.jl:18
[7] #20 at C:\Users\dongjx\.julia\packages\Feather\pbm3o\src\sink.jl:32 [inlined]
[8] open(::Feather.var"#20#21"{String,String,DataFrame}, ::String, ::Vararg{String,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at .\io.jl:298
[9] open at .\io.jl:296 [inlined]
[10] #write#19 at C:\Users\dongjx\.julia\packages\Feather\pbm3o\src\sink.jl:31 [inlined]
[11] write(::String, ::DataFrame) at C:\Users\dongjx\.julia\packages\Feather\pbm3o\src\sink.jl:31
[12] top-level scope at REPL[36]:1
Well, let me convert the type of :Species
:
test_data[!,:Species]=convert(Vector{Union{String,UInt8}},test_data[!,:Species])
and try to store it again, resulting in another error:
julia> Feather.write("test_data.feather",test_data)
ERROR: ArgumentError: cannot reinterpret `Union{UInt8, String}` `UInt8`, type `Union{UInt8, String}` is not a bits type
Stacktrace:
[1] (::Base.var"#throwbits#203")(::Type{Union{UInt8, String}}, ::Type{UInt8}, ::Type{Union{UInt8, String}}) at .\reinterpretarray.jl:16
[2] reinterpret(::Type{UInt8}, ::Array{Union{UInt8, String},1}) at .\reinterpretarray.jl:34
[3] Arrow.Primitive(::Array{Union{UInt8, String},1}) at C:\Users\dongjx\.julia\packages\Arrow\q3tEJ\src\primitives.jl:48
[4] arrowformat(::Array{Union{UInt8, String},1}) at C:\Users\dongjx\.julia\packages\Arrow\q3tEJ\src\arrowvectors.jl:242
[5] getarrow(::Array{Union{UInt8, String},1}) at C:\Users\dongjx\.julia\packages\Feather\pbm3o\src\sink.jl:37
[6] write(::IOStream, ::DataFrame; description::String, metadata::String) at C:\Users\dongjx\.julia\packages\Feather\pbm3o\src\sink.jl:18
[7] #20 at C:\Users\dongjx\.julia\packages\Feather\pbm3o\src\sink.jl:32 [inlined]
[8] open(::Feather.var"#20#21"{String,String,DataFrame}, ::String, ::Vararg{String,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at .\io.jl:298
[9] open at .\io.jl:296 [inlined]
[10] #write#19 at C:\Users\dongjx\.julia\packages\Feather\pbm3o\src\sink.jl:31 [inlined]
[11] write(::String, ::DataFrame) at C:\Users\dongjx\.julia\packages\Feather\pbm3o\src\sink.jl:31
[12] top-level scope at REPL[40]:1
So I try to convert the type into purely String
:
julia> test_data[!,:Species]=convert(Vector{String},test_data[!,:Species])
and try again:
julia> Feather.write("test_data.feather",test_data)
"test_data.feather"
And it works!
But I still have a question here. Here my test_data
is retrieved from RDataets.jl
and is simple enough to transfer the type of :Species
to a Array of String
. But what if my data type is complex and I can't do this conversion? Furthermore, I have seen two scenarios that a DataFrame
cannot be written into a .feather file. So what are the general constrains on the types in a DataFrame
for it can apply Feather.write
?
Thanks in advance.
It appears that this issue is because the Arrow
package needs to be updated for changes in CategoricalArrays
. I think the Arrow structure for a CategoricalArray
or a PooledArray
should be DictEncoding
. The code in Arrow
is trying to use getlevels
to, well, get the levels of the CategoricalArray
, whereas now, according to DataAPI, I think it should use levels
. @ExpandingMan Should this issue be transferred to the Arrow
package?
It appears as if https://github.com/ExpandingMan/Arrow.jl/pull/52 already addresses this issue.