Creating DataFrames from Dicts does not work as Expected
When having dicts like below, DataFrames creates (for the HorsePower column) a vector of type Vector{Union{Nothing, Int64}}. What is expected is Vector{Union{Missing, Int64}}, since all tools in DataFrames counts on using Missing. There are solutions to this, but the current situation is very hard to handle for a developer without experience. A fix might break workarounds, but nothing else.
Dict{String, Any} with 9 entries: "Miles_per_Gallon" => 25 "Cylinders" => 4 "Origin" => "USA" "Weight_in_lbs" => 2046 "Displacement" => 98 "Acceleration" => 19 "Name" => "ford pinto" "Year" => "1971-01-01" "Horsepower" => nothing
Full example, which was genereated by Gemini in Colab (which works great by the way)
# prompt: parse https://github.com/altair-viz/vega_datasets/blob/4f67bdaad10f45e3549984e17e1b3088c731503d/vega_datasets/_data/cars.json in Julia drop missing Horsepower Group by origin calculate mean of cylinders and median of cylinders and mean horsepower
using HTTP
using JSON
using DataFrames
using Statistics
# Download the JSON data
url = "https://raw.githubusercontent.com/altair-viz/vega_datasets/4f67bdaad10f45e3549984e17e1b3088c731503d/vega_datasets/_data/cars.json"
response = HTTP.get(url)
# Check for successful request
if response.status == 200
# Parse the JSON data
data = JSON.parse(String(response.body))
# Convert JSON to DataFrame
df = DataFrame(data)
# Drop rows with missing "Horsepower"
df = dropmissing(df, :Horsepower)
# Group by origin
grouped_df = groupby(df, :Origin)
# Calculate mean and median of cylinders, and mean of horsepower for each group
result = combine(grouped_df, :Cylinders => mean => :mean_cylinders,
:Cylinders => median => :median_cylinders,
:Horsepower => mean => :mean_horsepower)
# Display the result
println(result)
else
println("Error downloading the file: ", response.status)
end```
Expected result: Working example
Acutal result:
'''
MethodError: no method matching +(::Float64, ::Nothing)
Closest candidates are:
+(::Any, ::Any, ::Any, ::Any...)
@ Base operators.jl:587
+(::Real, ::Complex{Bool})
@ Base complex.jl:319
+(::AbstractFloat, ::Bool)
@ Base bool.jl:176
...
Proposed solution: Add a keyword argument strict, to the DataFrame constructor, which is by default false, where conversions are allowed. Allow anything. strict=true would give the current behavior.
This is unrelated to DataFrames.jl. You should report the issue to JSON.jl package and convince the maintainers to produce missing instead of nothing when parsing such files. Note that:
julia> data[39]
Dict{String, Any} with 9 entries:
"Miles_per_Gallon" => 25
"Cylinders" => 4
"Origin" => "USA"
"Weight_in_lbs" => 2046
"Displacement" => 98
"Acceleration" => 19
"Name" => "ford pinto"
"Year" => "1971-01-01"
"Horsepower" => nothing
So nothing is in source data and DataFrames.jl just reuses it.
What you can do is easily transform all nothing into missing in df after its creation by writing:
julia> df = something.(df, missing)
406×9 DataFrame
Row │ Miles_per_Gallon Cylinders Origin Weight_in_lbs Displacement Acceleration Name Year Horsepower
│ Float64? Int64 String Int64 Float64 Float64 String String Int64?
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 18.0 8 USA 3504 307.0 12.0 chevrolet chevelle malibu 1970-01-01 130
2 │ 15.0 8 USA 3693 350.0 11.5 buick skylark 320 1970-01-01 165
3 │ 18.0 8 USA 3436 318.0 11.0 plymouth satellite 1970-01-01 150
4 │ 16.0 8 USA 3433 304.0 12.0 amc rebel sst 1970-01-01 150
5 │ 17.0 8 USA 3449 302.0 10.5 ford torino 1970-01-01 140
6 │ 15.0 8 USA 4341 429.0 10.0 ford galaxie 500 1970-01-01 198
7 │ 14.0 8 USA 4354 454.0 9.0 chevrolet impala 1970-01-01 220
8 │ 14.0 8 USA 4312 440.0 8.5 plymouth fury iii 1970-01-01 215
9 │ 14.0 8 USA 4425 455.0 10.0 pontiac catalina 1970-01-01 225
10 │ 15.0 8 USA 3850 390.0 8.5 amc ambassador dpl 1970-01-01 190
11 │ missing 4 Europe 3090 133.0 17.5 citroen ds-21 pallas 1970-01-01 115
12 │ missing 8 USA 4142 350.0 11.5 chevrolet chevelle concours (sw) 1970-01-01 165
13 │ missing 8 USA 4034 351.0 11.0 ford torino (sw) 1970-01-01 153
14 │ missing 8 USA 4166 383.0 10.5 plymouth satellite (sw) 1970-01-01 175
15 │ missing 8 USA 3850 360.0 11.0 amc rebel sst (sw) 1970-01-01 175
16 │ 15.0 8 USA 3563 383.0 10.0 dodge challenger se 1970-01-01 170
17 │ 14.0 8 USA 3609 340.0 8.0 plymouth 'cuda 340 1970-01-01 160
18 │ missing 8 USA 3353 302.0 8.0 ford mustang boss 302 1970-01-01 140
19 │ 15.0 8 USA 3761 400.0 9.5 chevrolet monte carlo 1970-01-01 150
20 │ 14.0 8 USA 3086 455.0 10.0 buick estate wagon (sw) 1970-01-01 225
21 │ 24.0 4 Japan 2372 113.0 15.0 toyota corona mark ii 1970-01-01 95
22 │ 22.0 6 USA 2833 198.0 15.5 plymouth duster 1970-01-01 95
23 │ 18.0 6 USA 2774 199.0 15.5 amc hornet 1970-01-01 97
24 │ 21.0 6 USA 2587 200.0 16.0 ford maverick 1970-01-01 85
25 │ 27.0 4 Japan 2130 97.0 14.5 datsun pl510 1970-01-01 88
26 │ 26.0 4 Europe 1835 97.0 20.5 volkswagen 1131 deluxe sedan 1970-01-01 46
27 │ 25.0 4 Europe 2672 110.0 17.5 peugeot 504 1970-01-01 87
28 │ 24.0 4 Europe 2430 107.0 14.5 audi 100 ls 1970-01-01 90
29 │ 25.0 4 Europe 2375 104.0 17.5 saab 99e 1970-01-01 95
30 │ 26.0 4 Europe 2234 121.0 12.5 bmw 2002 1970-01-01 113
31 │ 21.0 6 USA 2648 199.0 15.0 amc gremlin 1970-01-01 90
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
377 │ 27.0 4 USA 2640 112.0 18.6 chevrolet cavalier wagon 1982-01-01 88
378 │ 34.0 4 USA 2395 112.0 18.0 chevrolet cavalier 2-door 1982-01-01 88
379 │ 31.0 4 USA 2575 112.0 16.2 pontiac j2000 se hatchback 1982-01-01 85
380 │ 29.0 4 USA 2525 135.0 16.0 dodge aries se 1982-01-01 84
381 │ 27.0 4 USA 2735 151.0 18.0 pontiac phoenix 1982-01-01 90
382 │ 24.0 4 USA 2865 140.0 16.4 ford fairmont futura 1982-01-01 92
383 │ 23.0 4 USA 3035 151.0 20.5 amc concord dl 1982-01-01 missing
384 │ 36.0 4 Europe 1980 105.0 15.3 volkswagen rabbit l 1982-01-01 74
385 │ 37.0 4 Japan 2025 91.0 18.2 mazda glc custom l 1982-01-01 68
386 │ 31.0 4 Japan 1970 91.0 17.6 mazda glc custom 1982-01-01 68
Would that work for you?
I am a very experienced Julia user. I can solve all the quirks in Julia. Thats why I am a DataFrames user. This is a simple taks, just read JSON into a DataFrame and do some processing.
Please make DataFrames do this processing " df = something.(df, missing)". You can have a flag strict=true where you don't do this processing. There are too many packages to adapt to the DataFrame package. You have BSON, Mongoc (a great package by the way), other data packages, etc.
Simple tasks are way too difficult.
I can solve the issues, but others can't and will not consider Julia.
using DataFrames
car1=Dict( "id"=>"f3eb12eb-3135-4a0b-a2fd-4ccbb981974e", "name"=>"Fiat","engine"=>Dict("id"=>"0e2e1a0d-c7ea-446a-b4c0-6c1fc441bd73", "cylinders"=>4, "horsepower"=>100))
car2=Dict("doors"=>4, "id"=>"8ee4e187-5890-4953-98cf-b97bb819befb", "name"=>"Volswagen", "engine"=>Dict("id"=>"da8fc4c5-ba57-40d9-84ba-097f23070e55", "cylinders"=>6, "horsepower"=>170))
car3=Dict("doors"=>4, "id"=>"f3eb12eb-3135-4a0b-a2fd-4ccbb981974e", "name"=>"Volvo","engine"=>Dict("id"=>"0e2e1a0d-c7ea-446a-b4c0-6c1fc441bd73", "cylinders"=>4, "horsepower"=>125))
cars=[car1;car2;car3]
carsDF_temp=DataFrame(cars)
sum(skipmissing(carsDF_temp.doors))
For data in Dicts, the Dicts may or may not follow a schema. So this is a valid case. In JSON it is quite common to have missing fields.
DataFrames works very well for CSV data, but for other data sources, it's difficult to use. If users would get the CSV experience from e.g. JSON, it would be great.
I prepared a final example. This JSON represents a dataset in the real world. The task is to calculate (using DataFrames) mean of doors, tire size, cylinders, and horsepower.
json_string=raw"""
[{"tires":{"front":"14\\"", "back":"14\\""}, "name":"Fiat","id":"f3eb12eb-3135-4a0b-a2fd-4ccbb981974e","engine":{"horsepower":null,"id":"0e2e1a0d-c7ea-446a-b4c0-6c1fc441bd73","cylinders":4}},
{"tires":{"front":"15\\"", "back":"15\\""}, "name":"Volswagen","id":"8ee4e187-5890-4953-98cf-b97bb819befb","doors":4,"engine":{ "horsepower":170,"id":"da8fc4c5-ba57-40d9-84ba-097f23070e55","cylinders":6}},
{"tires":{"front":"16\\"", "back":"17\\""}, "name":"Volvo","id":"f3eb12eb-3135-4a0b-a2fd-4ccbb981974e","doors":4,"engine":{ "horsepower":125,"id":"0e2e1a0d-c7ea-446a-b4c0-6c1fc441bd73","cylinders":4}}]
"""
imput_json=JSON.parse(json_string)
If that is not enough, test this file
Here is a dataset which demonstrates the current problems well:
https://raw.githubusercontent.com/altair-viz/vega_datasets/master/vega_datasets/_data/wheat.json
@stensmo You can just add the null=missing parameter to JSON.parse :
data = JSON.parse(String(response.body);null=missing)