DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

Initializing Dataframe from vector of named tuples: missing values

Open lukas-weber opened this issue 1 year ago • 5 comments

When I initialize my dataframe using

df = DataFrame([(a=1,b=2), (a=3,b=4), (a=1,)])

I would expect to get

a b
1 2
3 4
1 missing

Instead I get an error

ERROR: type NamedTuple has no field b
Stacktrace:
  [1] getproperty
    @ ./Base.jl:37 [inlined]
  [2] getcolumn
    @ ~/.julia/packages/Tables/AcRIE/src/Tables.jl:102 [inlined]

Is there a simple way to get the former behavior? Should it be what DataFrames does by default instead of throwing an error?

lukas-weber avatar Aug 15 '23 19:08 lukas-weber

Is there a simple way to get the former behavior?

julia> DataFrame(Tables.dictrowtable([(a=1,b=2), (a=3,b=4), (a=1,)]))
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64?
─────┼────────────────
   1 │     1        2
   2 │     3        4
   3 │     1  missing

Should it be what DataFrames does by default instead of throwing an error?

It is on purpose strict. The Tables.dictrowtable was designed to handle such cases. Alternatively you could use:

julia> reduce((df, x) -> push!(df, x; cols=:union), [(a=1,b=2), (a=3,b=4), (a=1,)], init=DataFrame())
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64?
─────┼────────────────
   1 │     1        2
   2 │     3        4
   3 │     1  missing

which is faster but more verbose.


Having said that maybe indeed it makes sense to allow for what you ask for in the constructor, so I keep the issue open.

bkamins avatar Aug 15 '23 19:08 bkamins

Thanks!

lukas-weber avatar Aug 15 '23 20:08 lukas-weber

@quinnj, @nalimilan - what do you think? Now as I think about it maybe a lighter wrapper than Tables.dictrowtable could be introduced in Tables.jl that would produce missing in case a value in a column is missing? The issue is that Tables.dictrowtable materializes Dict entries for each row, while maybe we could have some similar "lazy wrapper" that would not materialize anything, but just when getcolumn is called would be able to inject missing where needed?

bkamins avatar Aug 15 '23 20:08 bkamins

Tables.dictrowtable is indeed hard to discover. Maybe we could support cols=:union like in vcat?

nalimilan avatar Aug 16 '23 07:08 nalimilan

Maybe we could support cols=:union like in vcat?

This is what I was also considering. The problem is that it is not an issue on the level of DataFrames.jl. We delegate construction of columns to Tables.jl, so Tables.jl would need to have this kind of mechanism first (it is Tables.jl that throws an error not DataFrames.jl).

bkamins avatar Aug 16 '23 11:08 bkamins