arrow-julia icon indicating copy to clipboard operation
arrow-julia copied to clipboard

Missing type gets lost when writing partitions of DataFrame

Open svilupp opened this issue 2 years ago • 2 comments

This is an odd one and likely to be a PICNIC...

Problem: Missigness in a string column is lost after saving/loading arrow file

When it happens: When a column in my dataset has type Union{Missing,String}, I partition it, and the missing item appears only in the later partitions. It's easily reproducible (see below).

Debugging:

  • It happens only to DataFrames (not Tables.rowtable when created from a namedtuple)
  • Only when partitioned as Iterators.partition(Tables.rows(df), 2). If partitioned as Iterators.partition(df,2) available from version >1.5.0, it is fine
  • If missing type appears in the first partition, it's fine
  • Validity bitmap is written correctly
  • But field is marked as not-nullable (!)

┌ Debug: building field: name = x1, nullable = false, T = String, type = Arrow.Flatbuf.Utf8 └ @ Arrow ~/Documents/GitHub/arrow-julia/src/write.jl:486 --- in correct cases, this appears ┌ Debug: building field: name = x1, nullable = true, T = Union{Missing, String}, type = Arrow.Flatbuf.Utf8 └ @ Arrow ~/Documents/GitHub/arrow-julia/src/write.jl:486

MWE

using Arrow, Tables, Random, DataFramesMeta
using Logging
debuglogger = ConsoleLogger(stderr, Logging.Debug)

# Create dataset
fn = "test_types.arrow"
df = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4)) |> DataFrame

# Works okay
Arrow.write(fn, df; compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# Arrow.List{Union{Missing, String}, Int32, Vector{UInt8}}

# Works okay
Arrow.write(fn, Iterators.partition(df,2); compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# SentinelArrays.ChainedVector{Union{Missing, String}, Arrow.List{Union{Missing, String}, Int32, Vector{UInt8}}}:

# broken -- missingness is lost
Arrow.write(fn, Iterators.partition(Tables.rows(df), 2); compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# SentinelArrays.ChainedVector{String, Arrow.List{String, Int32, Vector{UInt8}}}

# Works okay with Tables
t = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4))
Arrow.write(fn, Iterators.partition(Tables.rows(t), 2); compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# SentinelArrays.ChainedVector{Union{Missing, String}, Arrow.List{Union{Missing, String}, Int32, Vector{UInt8}}}

Versioninfo:

Julia Version 1.8.5 Commit 17cfb8e65ea (2023-01-08 06:45 UTC) Platform Info: OS: macOS (arm64-apple-darwin21.5.0) CPU: 8 × Apple M1 Pro WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1) Threads: 6 on 6 virtual cores

Arrow: 2.4.3 on main branch

svilupp avatar Mar 12 '23 18:03 svilupp

I think I know where it's coming from.

The issue happens here

  • Only the first partition is scanned to determine the schema
  • Unfortunately, the partition of DataFrameRows loses the parent schema when pushed through Tables.columns
  • It does however keep the reference to the parent (and its schema)

In other words, we do partition |> Tables.columns |> Tables.schema, which loses the missingness.

I don't know enough about the Tables API/contract to know whether this is an Arrow problem, Tables problem, or DataFrames problem. Does this issue belong somewhere else?

It would be an easy fix to get schema info from the parent object, but are all Tables-compatible sources required to keep that?

Eg,

  • change from partition |> Tables.columns |> Tables.schema
  • to partition |> Tables.columns |> Base.Fix2(getfield,:parent) |> Tables.schema

Should I open a PR?

Illustration

# correct when working with Tables object
t = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4))
for part in Iterators.partition(Tables.rows(t), 2)
    @info "Parent type: $(part.parent|>Tables.schema)"
    @info "Columns type: $(Tables.columns(part)|>Tables.schema)"
end

  ┌ Info: Parent type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64
  ┌ Info: Columns type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64
  ┌ Info: Parent type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64
  ┌ Info: Columns type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64

# incorrect when working with DataFrame
df = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4)) |> DataFrame
for part in Iterators.partition(Tables.rows(df), 2)
    @info "Parent type: $(part.parent|>Tables.schema)"
    @info "Columns type: $(Tables.columns(part)|>Tables.schema)"
end

  ┌ Info: Parent type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64
  ┌ Info: Columns type: Tables.Schema:
  │  :x1  String
  └  :x2  Int64
  ┌ Info: Parent type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64
  ┌ Info: Columns type: Tables.Schema:
  │  :x1  Union{Missing, String}
  └  :x2  Int64

EDIT: I suspect this will affect other partitioners that rely on Iterators over Tables.rows(), eg, TableOperations.makepartition()

svilupp avatar Mar 12 '23 19:03 svilupp

At the moment, a similar thing is blocking #477.

evetion avatar Nov 10 '23 15:11 evetion