Missing type gets lost when writing partitions of DataFrame
This is an odd one and likely to be a PICNIC...
Problem: Missigness in a string column is lost after saving/loading arrow file
When it happens: When a column in my dataset has type Union{Missing,String}, I partition it, and the missing item appears only in the later partitions. It's easily reproducible (see below).
Debugging:
- It happens only to DataFrames (not Tables.rowtable when created from a namedtuple)
- Only when partitioned as
Iterators.partition(Tables.rows(df), 2). If partitioned asIterators.partition(df,2)available from version >1.5.0, it is fine - If missing type appears in the first partition, it's fine
- Validity bitmap is written correctly
- But field is marked as not-nullable (!)
┌ Debug: building field: name = x1, nullable = false, T = String, type = Arrow.Flatbuf.Utf8 └ @ Arrow ~/Documents/GitHub/arrow-julia/src/write.jl:486 --- in correct cases, this appears ┌ Debug: building field: name = x1, nullable = true, T = Union{Missing, String}, type = Arrow.Flatbuf.Utf8 └ @ Arrow ~/Documents/GitHub/arrow-julia/src/write.jl:486
MWE
using Arrow, Tables, Random, DataFramesMeta
using Logging
debuglogger = ConsoleLogger(stderr, Logging.Debug)
# Create dataset
fn = "test_types.arrow"
df = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4)) |> DataFrame
# Works okay
Arrow.write(fn, df; compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# Arrow.List{Union{Missing, String}, Int32, Vector{UInt8}}
# Works okay
Arrow.write(fn, Iterators.partition(df,2); compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# SentinelArrays.ChainedVector{Union{Missing, String}, Arrow.List{Union{Missing, String}, Int32, Vector{UInt8}}}:
# broken -- missingness is lost
Arrow.write(fn, Iterators.partition(Tables.rows(df), 2); compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# SentinelArrays.ChainedVector{String, Arrow.List{String, Int32, Vector{UInt8}}}
# Works okay with Tables
t = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4))
Arrow.write(fn, Iterators.partition(Tables.rows(t), 2); compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# SentinelArrays.ChainedVector{Union{Missing, String}, Arrow.List{Union{Missing, String}, Int32, Vector{UInt8}}}
Versioninfo:
Julia Version 1.8.5 Commit 17cfb8e65ea (2023-01-08 06:45 UTC) Platform Info: OS: macOS (arm64-apple-darwin21.5.0) CPU: 8 × Apple M1 Pro WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1) Threads: 6 on 6 virtual cores
Arrow: 2.4.3 on main branch
I think I know where it's coming from.
The issue happens here
- Only the first partition is scanned to determine the schema
- Unfortunately, the partition of DataFrameRows loses the parent schema when pushed through Tables.columns
- It does however keep the reference to the parent (and its schema)
In other words, we do partition |> Tables.columns |> Tables.schema, which loses the missingness.
I don't know enough about the Tables API/contract to know whether this is an Arrow problem, Tables problem, or DataFrames problem. Does this issue belong somewhere else?
It would be an easy fix to get schema info from the parent object, but are all Tables-compatible sources required to keep that?
Eg,
- change from
partition |> Tables.columns |> Tables.schema - to
partition |> Tables.columns |> Base.Fix2(getfield,:parent) |> Tables.schema
Should I open a PR?
Illustration
# correct when working with Tables object
t = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4))
for part in Iterators.partition(Tables.rows(t), 2)
@info "Parent type: $(part.parent|>Tables.schema)"
@info "Columns type: $(Tables.columns(part)|>Tables.schema)"
end
┌ Info: Parent type: Tables.Schema:
│ :x1 Union{Missing, String}
└ :x2 Int64
┌ Info: Columns type: Tables.Schema:
│ :x1 Union{Missing, String}
└ :x2 Int64
┌ Info: Parent type: Tables.Schema:
│ :x1 Union{Missing, String}
└ :x2 Int64
┌ Info: Columns type: Tables.Schema:
│ :x1 Union{Missing, String}
└ :x2 Int64
# incorrect when working with DataFrame
df = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4)) |> DataFrame
for part in Iterators.partition(Tables.rows(df), 2)
@info "Parent type: $(part.parent|>Tables.schema)"
@info "Columns type: $(Tables.columns(part)|>Tables.schema)"
end
┌ Info: Parent type: Tables.Schema:
│ :x1 Union{Missing, String}
└ :x2 Int64
┌ Info: Columns type: Tables.Schema:
│ :x1 String
└ :x2 Int64
┌ Info: Parent type: Tables.Schema:
│ :x1 Union{Missing, String}
└ :x2 Int64
┌ Info: Columns type: Tables.Schema:
│ :x1 Union{Missing, String}
└ :x2 Int64
EDIT: I suspect this will affect other partitioners that rely on Iterators over Tables.rows(), eg, TableOperations.makepartition()
At the moment, a similar thing is blocking #477.