Query.jl icon indicating copy to clipboard operation
Query.jl copied to clipboard

@groupby failed on CategoricalString with missing value

Open AnselmJeong opened this issue 5 years ago • 8 comments

Let's say, we try to @groupby on a CategoricalString column

df = DataFrame(
    fruit    =  ["Apple", "Banana", "Cherry", "Apple", "Banana", "Cherry"],
    price    =  [1.2, 2.0, 0.4, 1.2, 2.0, 0.4])

categorical!(df, :fruit)
df |> @groupby(_.fruit) |> @map({key=key(_), m=mean(_.price)})

The result is exactly as expected.

3×2 DataFrame
│ Row │ key          │ m       │
│     │ Categorical… │ Float64 │
├─────┼──────────────┼─────────┤
│ 1   │ Apple        │ 1.2     │
│ 2   │ Banana       │ 2.0     │
│ 3   │ Cherry       │ 0.4     │

Next, we make the same column contain some missing values.

df = DataFrame(
    fruit    =  ["Apple", missing, "Cherry", "Apple", "Banana", "Cherry"],
    price    =  [1.2, 2.0, 0.4, 1.2, 2.0, 0.4])

categorical!(df, :fruit)
df |> @groupby(_.fruit) |> @map({key=key(_), m=mean(_.price)})

Then, if failed with an error message stating somewhat like this.

ERROR: MethodError: convert(::Type{DataValue{CategoricalString{UInt32}}}, ::CategoricalString{UInt32}) is ambiguous

However, this behavior is weird since String column with similar missing values does not issue any complaints.

df = DataFrame(
    fruit    =  ["Apple", missing, "Cherry", "Apple", "Banana", "Cherry"],
    price    =  [1.2, 2.0, 0.4, 1.2, 2.0, 0.4])

df |> @groupby(_.fruit) |> @map({key=key(_), m=mean(_.price)})
key      │ m
─────────┼────
"Apple"  │ 1.2
#NA      │ 2.0
"Cherry" │ 0.4
"Banana" │ 2.0

This limitation may have to be fixed, since it is natural that String columns are recoded as Categorical columns before beginning any analysis.

Thanks.

AnselmJeong avatar Aug 18 '19 05:08 AnselmJeong

Similar behavior found on @filter missing values.

df = DataFrame(
    fruit    =  ["Apple", missing, "Cherry", "Apple", "Banana", "Cherry"],
    price    =  [1.2, 2.0, 0.4, 1.2, 2.0, 0.4])

categorical!(df, :fruit)
df |> @filter(!isna(_.fruit)) |> DataFrame

It fails with the same ERROR message:

LoadError: MethodError: convert(::Type{DataValue{CategoricalString{UInt32}}}, ::CategoricalString{UInt32}) is ambiguous

This example works without any error if String colums were not converted into Categorical columns or if there were not any missing values.

The same kind of error persists with simpler types of filtering.

df |> @filter(_.fruit == "Banana") |> DataFrame

Also fails with the same ERROR message.

Conclusively, @filter or @groupby seems not work with Categorical variables with missing value.

Thanks.

AnselmJeong avatar Aug 18 '19 05:08 AnselmJeong

Same for

df |> @select(:price)

or

df |> @filter(true)

(with the above DataFrame).

So it seems that the problem occurs when transforming the DataFrame to a Query.jl iterator.

greimel avatar Aug 20 '19 12:08 greimel

This is actually unrelated to Query.jl, one can get the same error with:

using IteratorInterfaceExtensions

it = getiterator(df)

first(it)

The implementation for that lives in Tables.jl.

The root problem though seems to be that both DataValues.jl and CategoricalArrays.jl add methods to convert that don't play well together.

I'm a bit lost what to do here, maybe @quinnj has an idea? Also CCing @nalimilan, the resident categorical expert. Is this a case that needs special casing in Tables.jl?

davidanthoff avatar Aug 25 '19 04:08 davidanthoff

I haven't been able to reproduce locally (even after updating all packages). Is there anything special I need to do? I have DataValues 0.4.12.

nalimilan avatar Aug 25 '19 15:08 nalimilan

Here is what I'm using:

(repo1) pkg> st
    Status `C:\Users\david\.julia\environments\repo1\Project.toml`
  [a93c6f00] DataFrames v0.19.2
  [e7dc6d0d] DataValues v0.4.12
  [82899510] IteratorInterfaceExtensions v1.0.0

And the code:

julia> using DataFrames, IteratorInterfaceExtensions, DataValues

julia> df = DataFrame(fruit= ["Apple", missing, "Cherry", "Apple", "Banana", "Cherry"], price= [1.2, 2.0, 0.4, 1.2, 2.0, 0.4]);

julia> categorical!(df, :fruit);

julia> it = getiterator(df);

julia> first(it)
ERROR: MethodError: convert(::Type{DataValue{CategoricalString{UInt32}}}, ::CategoricalString{UInt32}) is ambiguous. Candidates:
  convert(::Type{S}, x::T) where {S, T<:(Union{CategoricalString{R}, CategoricalValue{T,R} where T} where R)} in CategoricalArrays at C:\Users\david\.julia\packages\CategoricalArrays\xjesC\src\value.jl:91
  convert(::Type{DataValue{T}}, x::T) where T in DataValues at C:\Users\david\.julia\packages\DataValues\XQWvG\src\scalar\core.jl:31
Possible fix, define
  convert(::Type{DataValue{T<:(Union{CategoricalString{R}, CategoricalValue{T,R} where T} where R)}}, ::T<:(Union{CategoricalString{R}, CategoricalValue{T,R} where T} where R))
Stacktrace:
 [1] macro expansion at C:\Users\david\.julia\packages\Tables\FXXeK\src\tofromdatavalues.jl:102 [inlined]
 [2] iterate(::Tables.DataValueRowIterator{NamedTuple{(:fruit, :price),Tuple{DataValue{CategoricalString{UInt32}},Float64}},Tables.RowIterator{NamedTuple{(:fruit, :price),Tuple{CategoricalArray{Union{Missing, String},1,UInt32,String,CategoricalString{UInt32},Missing},Array{Float64,1}}}}}, ::Tuple{}) at C:\Users\david\.julia\packages\Tables\FXXeK\src\tofromdatavalues.jl:96 (repeats 2 times)
 [3] first(::Tables.DataValueRowIterator{NamedTuple{(:fruit, :price),Tuple{DataValue{CategoricalString{UInt32}},Float64}},Tables.RowIterator{NamedTuple{(:fruit, :price),Tuple{CategoricalArray{Union{Missing, String},1,UInt32,String,CategoricalString{UInt32},Missing},Array{Float64,1}}}}}) at .\abstractarray.jl:342
 [4] top-level scope at REPL[11]:1 

Make sure you load DataValues, otherwise Tables.jl will use a different code path.

davidanthoff avatar Aug 25 '19 18:08 davidanthoff

Still no error with these versions:

(v1.2) pkg> st
    Status `~/.julia/environments/v1.2/Project.toml`
  [6e4b80f9] BenchmarkTools v0.4.2
  [336ed68f] CSV v0.5.0 [`~/.julia/dev/CSV`]
  [324d7699] CategoricalArrays v0.5.5 [`~/.julia/dev/CategoricalArrays`]
  [a93c6f00] DataFrames v0.19.2
  [e7dc6d0d] DataValues v0.4.12
  [82899510] IteratorInterfaceExtensions v1.0.0
  [2dfb63ee] PooledArrays v0.5.2+ [`~/.julia/dev/PooledArrays`]
  [1a8c2f83] Query v0.12.1
  [295af30f] Revise v1.0.2
  [bd369af6] Tables v0.2.6

nalimilan avatar Aug 26 '19 09:08 nalimilan

@nalimilan: here is a reproducer repo that you can run with this: Binder

davidanthoff avatar Aug 26 '19 19:08 davidanthoff

Thanks. So this is indeed a legitimate ambiguity which cannot be fixed separately in either package. I think @quinnj's https://github.com/JuliaLang/julia/pull/32613 should help fixing this, but that won't work for current Julia versions. Maybe we should define that unwrap method in Compat or in DataAPI. But first we should bump that PR so that at least it's merged in master.

Do you think this is a new problem? I still don't understand why I didn't observe it in my original tests.

nalimilan avatar Sep 01 '19 16:09 nalimilan