AlgebraOfGraphics.jl icon indicating copy to clipboard operation
AlgebraOfGraphics.jl copied to clipboard

Missing data support

Open piever opened this issue 4 years ago • 4 comments

In particular, remove the need to drop missing entries in penguin tutorial.

piever avatar May 09 '21 11:05 piever

As I said in the corresponding issue on Makie, I really want to handle missingness explicitly myself. I don't want tools to ignore missings without my intentionally requesting it. In my view, support for missing data is helping me handle the missing data -- not ignoring them.

So I hope this proposal does not get implemented, or at least not in the naive way that most graphics tools do it. I just want an error that tells me where the missings are. If you feel compelled to offer an "ignore missings" option, I hope it's explicit, like skipmissingdata(df) or data(df, missings=:skip). Plots made with data(df) using the missing columns could raise Error: missing data found in columns [a, b, c]. Use skipmissingdata(df) instead.

jtrakk avatar May 21 '21 18:05 jtrakk

Reading the tutorial, I didn't find the dropmissing inelegant or annoying but rather clear and explicit.

knuesel avatar May 22 '21 12:05 knuesel

I would also like to special case missing data as little as possible.

A possible plan of action could be the following.

  • For categorical data, don't do anything. By default, missing will show up as one category.
  • For numerical data, also don't do anything. The column with missing values will be handed as is to the plotting function or to the analysis.

Optional:

  • Create a transformation (say filternumeric, or filterfinite) whose sole purpose is to remove missing data from numerical columns in the mapping. This way, one could do mapping(x, y) * filternumeric() * linear(). Users are free to create their own transformations to do missing data imputation. Transformations are composed, so this would work with the framework.

So, other than the "optional" filtering transformation, the only thing to do is to make sure that columns with only numbers and missing values are treated as continuous rather than categorical (AoG should basically just copy what StatsModels is doing).

piever avatar May 24 '21 17:05 piever

AoG should basically just copy what StatsModels is doing

julia> lm(@formula(y ~ x), DataFrame(x=[1,2,3,4], y=[11, 25, 30, missing]))

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.CholeskyPivoted{Float64,Array{Float64,2}}}},Array{Float64,2}}

y ~ 1 + x

Coefficients:
────────────────────────────────────────────────────────────────────
             Coef.  Std. Error     t  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────
(Intercept)    3.0     5.61249  0.53    0.6875   -68.3134    74.3134
x              9.5     2.59808  3.66    0.1699   -23.5117    42.5117
────────────────────────────────────────────────────────────────────

Currently, StatsModels/GLM ignores missing values. This is exactly the behavior I want to avoid. I really want this to raise an error.

For numerical data, also don't do anything. The column with missing values will be handed as is to the plotting function or to the analysis.

I think that makes sense. Currently Makie ignores missings, which I want to change, but that's Makie's problem, not AoG's.

jtrakk avatar May 24 '21 18:05 jtrakk