DataFrames.jl
DataFrames.jl copied to clipboard
Initializing Dataframe from vector of named tuples: missing values
When I initialize my dataframe using
df = DataFrame([(a=1,b=2), (a=3,b=4), (a=1,)])
I would expect to get
a | b |
---|---|
1 | 2 |
3 | 4 |
1 | missing |
Instead I get an error
ERROR: type NamedTuple has no field b
Stacktrace:
[1] getproperty
@ ./Base.jl:37 [inlined]
[2] getcolumn
@ ~/.julia/packages/Tables/AcRIE/src/Tables.jl:102 [inlined]
Is there a simple way to get the former behavior? Should it be what DataFrames does by default instead of throwing an error?
Is there a simple way to get the former behavior?
julia> DataFrame(Tables.dictrowtable([(a=1,b=2), (a=3,b=4), (a=1,)]))
3×2 DataFrame
Row │ a b
│ Int64 Int64?
─────┼────────────────
1 │ 1 2
2 │ 3 4
3 │ 1 missing
Should it be what DataFrames does by default instead of throwing an error?
It is on purpose strict. The Tables.dictrowtable
was designed to handle such cases.
Alternatively you could use:
julia> reduce((df, x) -> push!(df, x; cols=:union), [(a=1,b=2), (a=3,b=4), (a=1,)], init=DataFrame())
3×2 DataFrame
Row │ a b
│ Int64 Int64?
─────┼────────────────
1 │ 1 2
2 │ 3 4
3 │ 1 missing
which is faster but more verbose.
Having said that maybe indeed it makes sense to allow for what you ask for in the constructor, so I keep the issue open.
Thanks!
@quinnj, @nalimilan - what do you think? Now as I think about it maybe a lighter wrapper than Tables.dictrowtable
could be introduced in Tables.jl that would produce missing
in case a value in a column is missing? The issue is that Tables.dictrowtable
materializes Dict
entries for each row, while maybe we could have some similar "lazy wrapper" that would not materialize anything, but just when getcolumn
is called would be able to inject missing
where needed?
Tables.dictrowtable
is indeed hard to discover. Maybe we could support cols=:union
like in vcat
?
Maybe we could support cols=:union like in vcat?
This is what I was also considering. The problem is that it is not an issue on the level of DataFrames.jl. We delegate construction of columns to Tables.jl, so Tables.jl would need to have this kind of mechanism first (it is Tables.jl that throws an error not DataFrames.jl).