transform(df, :symbolname => AsTable) will not work for JSON data
While working with JSON data (or similar) in Julia it is common to have Dicts which you want to flatten/unnest.
transform(df, :b => AsTable) could be a great solution, but in JSON it is common for keys to be missing, and transform(df, :b => AsTable) will not work for this case.
Example: a1=Dict("a"=>1, "b"=>Dict("c"=>1, "d"=>2)) a2=Dict("a"=>1, "b"=>Dict("c"=>1)) a=[a1;a2] df=DataFrame(a) bug=transform(df, :b => AsTable)
ERROR: ArgumentError: keys of the returned elements must be identical
Expected result: Missing value in row 2 for column d
I think this issue is related.
I found this from SO, but I doubt that's a desirable option.
I'd default to breaking up the loops and assigning them instead.
using DataFrames
a1 = Dict("a" => 1, "b" => Dict("c" => 1, "d" => 2))
a2 = Dict("a" => 1, "b" => Dict("c" => 1))
data = [a1; a2]
df = DataFrame(data)
# assignment
df_rows = 1:nrow(df)
df.b_c = [get(df[i,:b], "c", missing) for i in df_rows]
df.b_d = [get(df[i,:b], "d", missing) for i in df_rows]
df[:, Not(:b)]
2×3 DataFrame
Row │ a b_c b_d
│ Int64 Int64 Int64?
─────┼───────────────────────
1 │ 1 1 2
2 │ 1 1 missing
Thanks, while this bug is being worked on, TidierData.jl solves this perfectly using the unnest_wider macro. So far, the macro works on anything I have tested.
using DataFrames
using TidierData
a1 = Dict("a" => 1, "b" => Dict("c" => 1, "d" => 2))
a2 = Dict("a" => 1, "b" => Dict("c" => 1))
data = [a1; a2]
df = DataFrame(data)
@unnest_wider df b
2×3 DataFrame
Row │ a b_c b_d
│ Int64 Int64 Int64?
─────┼───────────────────────
1 │ 1 1 2
2 │ 1 1 missing
Proposed solution: Add a keyword argument strict, to the DataFrame constructor (AsTable uses the DataFrame constructor), which is by default false, where conversions are allowed. Allow anything. strict=true would give the current behavior.
The code you used transform(df, :b => AsTable) is "strict" (i.e. it does error checking). If you want code to be non-strict use:
julia> bug=transform(df, :b => Tables.dictrowtable => AsTable)
2×4 DataFrame
Row │ b a c d
│ Dict… Int64 Int64 Int64?
─────┼─────────────────────────────────────────────
1 │ Dict("c"=>1, "d"=>2) 1 1 2
2 │ Dict("c"=>1) 1 1 missing
The Tables.dictrowtable transformation is designed to perform column unioning as requested.
How would you handle this case?
a1=Dict("a"=>1, "b"=>Dict("a"=>3, "d"=>2))
a2=Dict("a"=>1, "b"=>Dict("a"=>5))
a=[a1;a2]
bug2=transform(bug, :b => Tables.dictrowtable => AsTable)
It just works for me as intended:
julia> df = DataFrame(a)
2×2 DataFrame
Row │ b a
│ Dict… Int64
─────┼─────────────────────────────
1 │ Dict("a"=>3, "d"=>2) 1
2 │ Dict("a"=>5) 1
julia> transform(df, :b => Tables.dictrowtable => AsTable)
2×3 DataFrame
Row │ b a d
│ Dict… Int64 Int64?
─────┼──────────────────────────────────────
1 │ Dict("a"=>3, "d"=>2) 3 2
2 │ Dict("a"=>5) 5 missing
I am not sure you realized the problem, so I made another case. I will stick to the cars example in another bug report:
using DataFrames
car1=Dict("id"=>"f3eb12eb-3135-4a0b-a2fd-4ccbb981974e", "name"=>"Fiat","engine"=>Dict("id"=>"0e2e1a0d-c7ea-446a-b4c0-6c1fc441bd73", "cylinders"=>4, "horsepower"=>125))
car2=Dict("id"=>"8ee4e187-5890-4953-98cf-b97bb819befb", "name"=>"Volkswagen", "engine"=>Dict("id"=>"da8fc4c5-ba57-40d9-84ba-097f23070e55", "cylinders"=>6, "horsepower"=>170))
cars=[car1;car2]
carsDF_temp=DataFrame(cars)
carsDF=transform(carsDF_temp, :engine => Tables.dictrowtable => AsTable)
carsInStorage=DataFrame(id=["f3eb12eb-3135-4a0b-a2fd-4ccbb981974e", "8ee4e187-5890-4953-98cf-b97bb819befb"], price=[100,200])
availableEngines = innerjoin(carsDF, carsInStorage, on=:id)
Huh, weird, no avaiable engines?
Simply put, the id in carsDF_temp are replaced by the respective engine-id after using transform() for carsDF.
the
idincarsDF_tempare replaced by the respective engine-idafter usingtransform() forcarsDF.
Yes, and this is expected.
If that was expected, why did you suggest transform(df, :b => Tables.dictrowtable => AsTable)? Name collisions are very common in the real world.
I do not understand your question. The transform(df, :b => Tables.dictrowtable => AsTable) requests to replace the existing column with a new column if this is needed. If the user wants to give new column names one should write transform(df, :b => Tables.dictrowtable => [here list names of the columns you want]).
The user may not know that a new column is required. This is especially true whe working with e.g., large nests of data.