AlgebraOfGraphics.jl
AlgebraOfGraphics.jl copied to clipboard
Missing data support
In particular, remove the need to drop missing entries in penguin tutorial.
As I said in the corresponding issue on Makie, I really want to handle missingness explicitly myself. I don't want tools to ignore missings without my intentionally requesting it. In my view, support for missing data is helping me handle the missing data -- not ignoring them.
So I hope this proposal does not get implemented, or at least not in the naive way that most graphics tools do it. I just want an error that tells me where the missings are. If you feel compelled to offer an "ignore missings" option, I hope it's explicit, like skipmissingdata(df) or data(df, missings=:skip). Plots made with data(df) using the missing columns could raise Error: missing data found in columns [a, b, c]. Use skipmissingdata(df) instead.
Reading the tutorial, I didn't find the dropmissing inelegant or annoying but rather clear and explicit.
I would also like to special case missing data as little as possible.
A possible plan of action could be the following.
- For categorical data, don't do anything. By default,
missingwill show up as one category. - For numerical data, also don't do anything. The column with missing values will be handed as is to the plotting function or to the analysis.
Optional:
- Create a transformation (say
filternumeric, orfilterfinite) whose sole purpose is to remove missing data from numerical columns in the mapping. This way, one could domapping(x, y) * filternumeric() * linear(). Users are free to create their own transformations to do missing data imputation. Transformations are composed, so this would work with the framework.
So, other than the "optional" filtering transformation, the only thing to do is to make sure that columns with only numbers and missing values are treated as continuous rather than categorical (AoG should basically just copy what StatsModels is doing).
AoG should basically just copy what StatsModels is doing
julia> lm(@formula(y ~ x), DataFrame(x=[1,2,3,4], y=[11, 25, 30, missing]))
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.CholeskyPivoted{Float64,Array{Float64,2}}}},Array{Float64,2}}
y ~ 1 + x
Coefficients:
────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────
(Intercept) 3.0 5.61249 0.53 0.6875 -68.3134 74.3134
x 9.5 2.59808 3.66 0.1699 -23.5117 42.5117
────────────────────────────────────────────────────────────────────
Currently, StatsModels/GLM ignores missing values. This is exactly the behavior I want to avoid. I really want this to raise an error.
For numerical data, also don't do anything. The column with missing values will be handed as is to the plotting function or to the analysis.
I think that makes sense. Currently Makie ignores missings, which I want to change, but that's Makie's problem, not AoG's.