TableTransforms.jl icon indicating copy to clipboard operation
TableTransforms.jl copied to clipboard

Dealing with missing values

Open ceferisbarov opened this issue 3 years ago • 12 comments

I am not sure whether this is intentional, but when the input table has missing values, assert_continuous function throws AssertionError: columns must hold continuous variables, even though all other values are continuous.

Considering that there is still no transform for imputation, I think it would make sense to add an option to skip the missing values.

ceferisbarov avatar Mar 18 '22 19:03 ceferisbarov

Yes, I guess we could generalize assert_continuous to work in the presence of missing values.

juliohm avatar Mar 29 '22 22:03 juliohm

julia> scitype([1.0, 2.3, missing, 3.5])
AbstractVector{Union{Missing, Continuous}} (alias for AbstractArray{Union{Missing, Continuous}, 1})

ablaom avatar Mar 30 '22 22:03 ablaom

I generalized assert_continuous to work in the presence of missing values, but this results in errors during the transformation process. For example:

MethodError: Cannot `convert` an object of type Missing to an object of type Float64

I can think of several alternatives:

  • Automatically deal with missing values before the transformation to avoid these errors
  • Simply generalize assert_continuous and allow rest of the transform to throw these errors
  • Create a helpful warning message in assert_continuous function
  • Do nothing and let assert_continuous to assert that the table is not continuous.

Should I implement one of these? Or maybe something else?

ceferisbarov avatar Apr 03 '22 19:04 ceferisbarov

That is a great suggestion @ablaom , if I understood correctly you are suggesting that we use scitype on the column and then check if the result is <: AbstractVector{Union{Missing,Continuous}} ?

juliohm avatar Apr 03 '22 19:04 juliohm

@ceferisbarov try to use scitype as @ablaom suggested, happy to review a PR.

juliohm avatar Apr 03 '22 19:04 juliohm

@juliohm Great, I am working on it. The problem is that AbstractVector{Continuous} <: AbstractVector{Union{Missing,Continuous}} returns false, but Continuous <: Union{Missing, Continuous} returns true. Is there a reason why I shouldn't use the latter? The following line:

@assert all(T <: Continuous for T in types) "columns must hold continuous variables"

would be replaced with:

@assert all(T <: Union{Missing, Continuous} for T in types) "columns must hold continuous variables"

ceferisbarov avatar Apr 03 '22 19:04 ceferisbarov

I think we need two distinct assertion functions. The one we have is more strict in the sense that it asserts that we don't have missing values. Maybe we should add a new function assert_continuous_or_missing if a transform supports missing values. Can you please remind me why we started discussing this generalization? Do we really need to allow missing values in our currently implemented transforms? As far as I remember none of our statistical transforms, which require continuous values, support missing values.

juliohm avatar Apr 03 '22 22:04 juliohm

@juliohm The original request was to skip missing values. This can be done at least for Center transform. I believe we can create assert_continuous_or_missing as you said, and use it for transforms where we can skip missing values. And then we would have to add skip missing values option to the said transform.

ceferisbarov avatar Apr 04 '22 08:04 ceferisbarov

An example:

x1 = [1.0, 2.0, 3.0, 4.0, 5.0]
x2 = [missing, 2.0, 3.0, 4.0, 5.0]
x3 = [5.0, 5.0, 5.0, 5.0, 5.0]
t = TypedTables.Table(;x1, x2, x3)

t |> Center(skipmissing=true)

Output:

Table with 3 columns and 5 rows:
     x1    x2       x3
   ┌───────────────────
 1 │ -2.0  missing  0.0
 2 │ -1.0  -1.5     0.0
 3 │ 0.0   -0.5     0.0
 4 │ 1.0   0.5      0.0
 5 │ 2.0   1.5      0.0

ceferisbarov avatar Apr 04 '22 08:04 ceferisbarov

The problem is that AbstractVector{Continuous} <: AbstractVector{Union{Missing,Continuous}} returns false

Maybe this is what you're after:

julia> AbstractVector{Continuous} <: AbstractVector{<:Union{Missing,Continuous}}
true

ablaom avatar Apr 04 '22 20:04 ablaom

@ablaom This works, thanks! Do you mind looking look at the PR?

ceferisbarov avatar Apr 04 '22 21:04 ceferisbarov

Let's postpone this discussion to after v1.0 is out. We are working on the missing transforms in #13 and will soon release this first major version. The API is quite stable and we can't predict any major change in the future. Missing values can potentially change API and will be considered in a future major release.

juliohm avatar Apr 04 '22 23:04 juliohm

This is no longer an issue with the migration to DataScienceTraits.jl.

juliohm avatar Oct 25 '23 20:10 juliohm