DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

transform(df, :symbolname => AsTable) will not work for JSON data

Open stensmo opened this issue 1 year ago • 12 comments

While working with JSON data (or similar) in Julia it is common to have Dicts which you want to flatten/unnest.

transform(df, :b => AsTable) could be a great solution, but in JSON it is common for keys to be missing, and transform(df, :b => AsTable) will not work for this case.

Example: a1=Dict("a"=>1, "b"=>Dict("c"=>1, "d"=>2)) a2=Dict("a"=>1, "b"=>Dict("c"=>1)) a=[a1;a2] df=DataFrame(a) bug=transform(df, :b => AsTable)

ERROR: ArgumentError: keys of the returned elements must be identical

Expected result: Missing value in row 2 for column d

stensmo avatar Feb 18 '25 04:02 stensmo

I think this issue is related.

I found this from SO, but I doubt that's a desirable option.

I'd default to breaking up the loops and assigning them instead.

using DataFrames

a1 = Dict("a" => 1, "b" => Dict("c" => 1, "d" => 2))
a2 = Dict("a" => 1, "b" => Dict("c" => 1))

data = [a1; a2]

df = DataFrame(data)

# assignment
df_rows = 1:nrow(df)

df.b_c = [get(df[i,:b], "c", missing) for i in df_rows]
df.b_d = [get(df[i,:b], "d", missing) for i in df_rows]

df[:, Not(:b)]
2×3 DataFrame
 Row │ a      b_c    b_d
     │ Int64  Int64  Int64?
─────┼───────────────────────
   1 │     1      1        2
   2 │     1      1  missing

AuRobinson avatar Mar 21 '25 21:03 AuRobinson

Thanks, while this bug is being worked on, TidierData.jl solves this perfectly using the unnest_wider macro. So far, the macro works on anything I have tested.


using DataFrames
using TidierData

a1 = Dict("a" => 1, "b" => Dict("c" => 1, "d" => 2))
a2 = Dict("a" => 1, "b" => Dict("c" => 1))

data = [a1; a2]

df = DataFrame(data)

@unnest_wider df b


2×3 DataFrame
 Row │ a      b_c    b_d
     │ Int64  Int64  Int64?
─────┼───────────────────────
   1 │     1      1        2
   2 │     1      1  missing

stensmo avatar Mar 27 '25 16:03 stensmo

Proposed solution: Add a keyword argument strict, to the DataFrame constructor (AsTable uses the DataFrame constructor), which is by default false, where conversions are allowed. Allow anything. strict=true would give the current behavior.

stensmo avatar Mar 29 '25 07:03 stensmo

The code you used transform(df, :b => AsTable) is "strict" (i.e. it does error checking). If you want code to be non-strict use:

julia> bug=transform(df, :b => Tables.dictrowtable => AsTable)
2×4 DataFrame
 Row │ b                     a      c      d
     │ Dict…                 Int64  Int64  Int64?
─────┼─────────────────────────────────────────────
   1 │ Dict("c"=>1, "d"=>2)      1      1        2
   2 │ Dict("c"=>1)              1      1  missing

The Tables.dictrowtable transformation is designed to perform column unioning as requested.

bkamins avatar Mar 29 '25 12:03 bkamins

How would you handle this case?

a1=Dict("a"=>1, "b"=>Dict("a"=>3, "d"=>2))
a2=Dict("a"=>1, "b"=>Dict("a"=>5))
a=[a1;a2]
bug2=transform(bug, :b => Tables.dictrowtable => AsTable)

stensmo avatar Mar 29 '25 15:03 stensmo

It just works for me as intended:

julia> df = DataFrame(a)
2×2 DataFrame
 Row │ b                     a
     │ Dict…                 Int64
─────┼─────────────────────────────
   1 │ Dict("a"=>3, "d"=>2)      1
   2 │ Dict("a"=>5)              1

julia> transform(df, :b => Tables.dictrowtable => AsTable)
2×3 DataFrame
 Row │ b                     a      d
     │ Dict…                 Int64  Int64?
─────┼──────────────────────────────────────
   1 │ Dict("a"=>3, "d"=>2)      3        2
   2 │ Dict("a"=>5)              5  missing

bkamins avatar Mar 29 '25 16:03 bkamins

I am not sure you realized the problem, so I made another case. I will stick to the cars example in another bug report:

using DataFrames

car1=Dict("id"=>"f3eb12eb-3135-4a0b-a2fd-4ccbb981974e", "name"=>"Fiat","engine"=>Dict("id"=>"0e2e1a0d-c7ea-446a-b4c0-6c1fc441bd73", "cylinders"=>4, "horsepower"=>125))
car2=Dict("id"=>"8ee4e187-5890-4953-98cf-b97bb819befb", "name"=>"Volkswagen", "engine"=>Dict("id"=>"da8fc4c5-ba57-40d9-84ba-097f23070e55", "cylinders"=>6, "horsepower"=>170))


cars=[car1;car2]

carsDF_temp=DataFrame(cars)

carsDF=transform(carsDF_temp, :engine => Tables.dictrowtable => AsTable)

carsInStorage=DataFrame(id=["f3eb12eb-3135-4a0b-a2fd-4ccbb981974e", "8ee4e187-5890-4953-98cf-b97bb819befb"], price=[100,200])

availableEngines = innerjoin(carsDF, carsInStorage, on=:id)

Huh, weird, no avaiable engines?

stensmo avatar Mar 29 '25 16:03 stensmo

Simply put, the id in carsDF_temp are replaced by the respective engine-id after using transform() for carsDF.

AuRobinson avatar Mar 29 '25 17:03 AuRobinson

the id in carsDF_temp are replaced by the respective engine-id after using transform() for carsDF.

Yes, and this is expected.

bkamins avatar Mar 29 '25 21:03 bkamins

If that was expected, why did you suggest transform(df, :b => Tables.dictrowtable => AsTable)? Name collisions are very common in the real world.

stensmo avatar Mar 29 '25 23:03 stensmo

I do not understand your question. The transform(df, :b => Tables.dictrowtable => AsTable) requests to replace the existing column with a new column if this is needed. If the user wants to give new column names one should write transform(df, :b => Tables.dictrowtable => [here list names of the columns you want]).

bkamins avatar Mar 31 '25 08:03 bkamins

The user may not know that a new column is required. This is especially true whe working with e.g., large nests of data.

stensmo avatar Mar 31 '25 10:03 stensmo