DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

Creating DataFrames from Dicts does not work as Expected

Open stensmo opened this issue 11 months ago • 8 comments

When having dicts like below, DataFrames creates (for the HorsePower column) a vector of type Vector{Union{Nothing, Int64}}. What is expected is Vector{Union{Missing, Int64}}, since all tools in DataFrames counts on using Missing. There are solutions to this, but the current situation is very hard to handle for a developer without experience. A fix might break workarounds, but nothing else.

Dict{String, Any} with 9 entries: "Miles_per_Gallon" => 25 "Cylinders" => 4 "Origin" => "USA" "Weight_in_lbs" => 2046 "Displacement" => 98 "Acceleration" => 19 "Name" => "ford pinto" "Year" => "1971-01-01" "Horsepower" => nothing

Full example, which was genereated by Gemini in Colab (which works great by the way)

# prompt: parse https://github.com/altair-viz/vega_datasets/blob/4f67bdaad10f45e3549984e17e1b3088c731503d/vega_datasets/_data/cars.json in Julia drop missing Horsepower Group by origin calculate mean of cylinders and median of cylinders and mean horsepower

using HTTP
using JSON
using DataFrames
using Statistics

# Download the JSON data
url = "https://raw.githubusercontent.com/altair-viz/vega_datasets/4f67bdaad10f45e3549984e17e1b3088c731503d/vega_datasets/_data/cars.json"
response = HTTP.get(url)

# Check for successful request
if response.status == 200
    # Parse the JSON data
    data = JSON.parse(String(response.body))

    # Convert JSON to DataFrame
    df = DataFrame(data)

    # Drop rows with missing "Horsepower"
    df = dropmissing(df, :Horsepower)

    # Group by origin
    grouped_df = groupby(df, :Origin)


    # Calculate mean and median of cylinders, and mean of horsepower for each group
    result = combine(grouped_df, :Cylinders => mean => :mean_cylinders,
                             :Cylinders => median => :median_cylinders,
                             :Horsepower => mean => :mean_horsepower)


    # Display the result
    println(result)
else
    println("Error downloading the file: ", response.status)
end```



Expected result: Working example

Acutal result: 

'''
MethodError: no method matching +(::Float64, ::Nothing)

Closest candidates are:
  +(::Any, ::Any, ::Any, ::Any...)
   @ Base operators.jl:587
  +(::Real, ::Complex{Bool})
   @ Base complex.jl:319
  +(::AbstractFloat, ::Bool)
   @ Base bool.jl:176
  ...

stensmo avatar Mar 28 '25 06:03 stensmo

Proposed solution: Add a keyword argument strict, to the DataFrame constructor, which is by default false, where conversions are allowed. Allow anything. strict=true would give the current behavior.

stensmo avatar Mar 28 '25 18:03 stensmo

This is unrelated to DataFrames.jl. You should report the issue to JSON.jl package and convince the maintainers to produce missing instead of nothing when parsing such files. Note that:

julia> data[39]
Dict{String, Any} with 9 entries:
  "Miles_per_Gallon" => 25
  "Cylinders"        => 4
  "Origin"           => "USA"
  "Weight_in_lbs"    => 2046
  "Displacement"     => 98
  "Acceleration"     => 19
  "Name"             => "ford pinto"
  "Year"             => "1971-01-01"
  "Horsepower"       => nothing

So nothing is in source data and DataFrames.jl just reuses it.

What you can do is easily transform all nothing into missing in df after its creation by writing:

julia> df = something.(df, missing)
406×9 DataFrame
 Row │ Miles_per_Gallon  Cylinders  Origin  Weight_in_lbs  Displacement  Acceleration  Name                               Year        Horsepower
     │ Float64?          Int64      String  Int64          Float64       Float64       String                             String      Int64?
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │             18.0          8  USA              3504         307.0          12.0  chevrolet chevelle malibu          1970-01-01         130
   2 │             15.0          8  USA              3693         350.0          11.5  buick skylark 320                  1970-01-01         165
   3 │             18.0          8  USA              3436         318.0          11.0  plymouth satellite                 1970-01-01         150
   4 │             16.0          8  USA              3433         304.0          12.0  amc rebel sst                      1970-01-01         150
   5 │             17.0          8  USA              3449         302.0          10.5  ford torino                        1970-01-01         140
   6 │             15.0          8  USA              4341         429.0          10.0  ford galaxie 500                   1970-01-01         198
   7 │             14.0          8  USA              4354         454.0           9.0  chevrolet impala                   1970-01-01         220
   8 │             14.0          8  USA              4312         440.0           8.5  plymouth fury iii                  1970-01-01         215
   9 │             14.0          8  USA              4425         455.0          10.0  pontiac catalina                   1970-01-01         225
  10 │             15.0          8  USA              3850         390.0           8.5  amc ambassador dpl                 1970-01-01         190
  11 │        missing            4  Europe           3090         133.0          17.5  citroen ds-21 pallas               1970-01-01         115
  12 │        missing            8  USA              4142         350.0          11.5  chevrolet chevelle concours (sw)   1970-01-01         165
  13 │        missing            8  USA              4034         351.0          11.0  ford torino (sw)                   1970-01-01         153
  14 │        missing            8  USA              4166         383.0          10.5  plymouth satellite (sw)            1970-01-01         175
  15 │        missing            8  USA              3850         360.0          11.0  amc rebel sst (sw)                 1970-01-01         175
  16 │             15.0          8  USA              3563         383.0          10.0  dodge challenger se                1970-01-01         170
  17 │             14.0          8  USA              3609         340.0           8.0  plymouth 'cuda 340                 1970-01-01         160
  18 │        missing            8  USA              3353         302.0           8.0  ford mustang boss 302              1970-01-01         140
  19 │             15.0          8  USA              3761         400.0           9.5  chevrolet monte carlo              1970-01-01         150
  20 │             14.0          8  USA              3086         455.0          10.0  buick estate wagon (sw)            1970-01-01         225
  21 │             24.0          4  Japan            2372         113.0          15.0  toyota corona mark ii              1970-01-01          95
  22 │             22.0          6  USA              2833         198.0          15.5  plymouth duster                    1970-01-01          95
  23 │             18.0          6  USA              2774         199.0          15.5  amc hornet                         1970-01-01          97
  24 │             21.0          6  USA              2587         200.0          16.0  ford maverick                      1970-01-01          85
  25 │             27.0          4  Japan            2130          97.0          14.5  datsun pl510                       1970-01-01          88
  26 │             26.0          4  Europe           1835          97.0          20.5  volkswagen 1131 deluxe sedan       1970-01-01          46
  27 │             25.0          4  Europe           2672         110.0          17.5  peugeot 504                        1970-01-01          87
  28 │             24.0          4  Europe           2430         107.0          14.5  audi 100 ls                        1970-01-01          90
  29 │             25.0          4  Europe           2375         104.0          17.5  saab 99e                           1970-01-01          95
  30 │             26.0          4  Europe           2234         121.0          12.5  bmw 2002                           1970-01-01         113
  31 │             21.0          6  USA              2648         199.0          15.0  amc gremlin                        1970-01-01          90
  ⋮  │        ⋮              ⋮        ⋮           ⋮             ⋮             ⋮                        ⋮                      ⋮           ⋮
 377 │             27.0          4  USA              2640         112.0          18.6  chevrolet cavalier wagon           1982-01-01          88
 378 │             34.0          4  USA              2395         112.0          18.0  chevrolet cavalier 2-door          1982-01-01          88
 379 │             31.0          4  USA              2575         112.0          16.2  pontiac j2000 se hatchback         1982-01-01          85
 380 │             29.0          4  USA              2525         135.0          16.0  dodge aries se                     1982-01-01          84
 381 │             27.0          4  USA              2735         151.0          18.0  pontiac phoenix                    1982-01-01          90
 382 │             24.0          4  USA              2865         140.0          16.4  ford fairmont futura               1982-01-01          92
 383 │             23.0          4  USA              3035         151.0          20.5  amc concord dl                     1982-01-01     missing
 384 │             36.0          4  Europe           1980         105.0          15.3  volkswagen rabbit l                1982-01-01          74
 385 │             37.0          4  Japan            2025          91.0          18.2  mazda glc custom l                 1982-01-01          68
 386 │             31.0          4  Japan            1970          91.0          17.6  mazda glc custom                   1982-01-01          68

Would that work for you?

bkamins avatar Mar 29 '25 12:03 bkamins

I am a very experienced Julia user. I can solve all the quirks in Julia. Thats why I am a DataFrames user. This is a simple taks, just read JSON into a DataFrame and do some processing.

Please make DataFrames do this processing " df = something.(df, missing)". You can have a flag strict=true where you don't do this processing. There are too many packages to adapt to the DataFrame package. You have BSON, Mongoc (a great package by the way), other data packages, etc.

Simple tasks are way too difficult.

I can solve the issues, but others can't and will not consider Julia.

stensmo avatar Mar 29 '25 12:03 stensmo

using DataFrames



  car1=Dict( "id"=>"f3eb12eb-3135-4a0b-a2fd-4ccbb981974e", "name"=>"Fiat","engine"=>Dict("id"=>"0e2e1a0d-c7ea-446a-b4c0-6c1fc441bd73", "cylinders"=>4, "horsepower"=>100))
    car2=Dict("doors"=>4, "id"=>"8ee4e187-5890-4953-98cf-b97bb819befb", "name"=>"Volswagen", "engine"=>Dict("id"=>"da8fc4c5-ba57-40d9-84ba-097f23070e55", "cylinders"=>6, "horsepower"=>170))
    car3=Dict("doors"=>4, "id"=>"f3eb12eb-3135-4a0b-a2fd-4ccbb981974e", "name"=>"Volvo","engine"=>Dict("id"=>"0e2e1a0d-c7ea-446a-b4c0-6c1fc441bd73", "cylinders"=>4, "horsepower"=>125))

    cars=[car1;car2;car3]

    carsDF_temp=DataFrame(cars)

    sum(skipmissing(carsDF_temp.doors))

For data in Dicts, the Dicts may or may not follow a schema. So this is a valid case. In JSON it is quite common to have missing fields.

stensmo avatar Mar 31 '25 10:03 stensmo

DataFrames works very well for CSV data, but for other data sources, it's difficult to use. If users would get the CSV experience from e.g. JSON, it would be great.

stensmo avatar Mar 31 '25 10:03 stensmo

I prepared a final example. This JSON represents a dataset in the real world. The task is to calculate (using DataFrames) mean of doors, tire size, cylinders, and horsepower.


json_string=raw"""
[{"tires":{"front":"14\\"", "back":"14\\""}, "name":"Fiat","id":"f3eb12eb-3135-4a0b-a2fd-4ccbb981974e","engine":{"horsepower":null,"id":"0e2e1a0d-c7ea-446a-b4c0-6c1fc441bd73","cylinders":4}},
{"tires":{"front":"15\\"", "back":"15\\""}, "name":"Volswagen","id":"8ee4e187-5890-4953-98cf-b97bb819befb","doors":4,"engine":{ "horsepower":170,"id":"da8fc4c5-ba57-40d9-84ba-097f23070e55","cylinders":6}},
{"tires":{"front":"16\\"", "back":"17\\""}, "name":"Volvo","id":"f3eb12eb-3135-4a0b-a2fd-4ccbb981974e","doors":4,"engine":{ "horsepower":125,"id":"0e2e1a0d-c7ea-446a-b4c0-6c1fc441bd73","cylinders":4}}]
"""


imput_json=JSON.parse(json_string)

If that is not enough, test this file

`Real world JSON file``

stensmo avatar Mar 31 '25 11:03 stensmo

Here is a dataset which demonstrates the current problems well:

https://raw.githubusercontent.com/altair-viz/vega_datasets/master/vega_datasets/_data/wheat.json

stensmo avatar Apr 02 '25 12:04 stensmo

@stensmo You can just add the null=missing parameter to JSON.parse :

data = JSON.parse(String(response.body);null=missing)

jvigneron avatar May 22 '25 14:05 jvigneron