OneHotArrays.jl icon indicating copy to clipboard operation
OneHotArrays.jl copied to clipboard

Support Categorical Values directly

Open schlichtanders opened this issue 1 year ago • 3 comments

Motivation and description

In Data Science CategoricalArrays.CategoricalValue or CategoricalArrays.CategoricalVector and the like appear often. (RDatasets loads DataFrames with columns of that type by default).

It would be great if onehotbatch could simply be applied on this.

I just came to this package, still figuring out how to transform such a Categorical Value/Vector into onehot Vector/Matrix... It is very possible that I missed something obvious

Possible Implementation

No response

schlichtanders avatar Feb 28 '24 10:02 schlichtanders

Attempting to construct the minimal object:

julia> using CategoricalArrays, OneHotArrays

julia> cv = CategoricalArrays.CategoricalValue('b', CategoricalArray('a':'z'))
CategoricalValue{Char, UInt32} 'b'

julia> dump(cv)
CategoricalValue{Char, UInt32}
  pool: CategoricalPool{Char, UInt32, CategoricalValue{Char, UInt32}}
    levels: Array{Char}((26,))
      1: Char 'a'
      2: Char 'b'
      3: Char 'c'
      4: Char 'd'
      5: Char 'e'
      ...
      22: Char 'v'
      23: Char 'w'
      24: Char 'x'
      25: Char 'y'
      26: Char 'z'
    invindex: Dict{Char, UInt32}
      slots: Memory{UInt8}
        length: Int64 64
        ptr: Ptr{Nothing} @0x0000000160607020
    ...

julia> cv.pool.levels
26-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
 'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)
...

julia> Int(cv.ref), length(cv.pool.levels)
(2, 26)

julia> OneHotArrays.onehot(cv::CategoricalValue) = OneHotVector(cv.ref, length(cv.pool.levels))

julia> onehot(cv)
26-element OneHotVector(::UInt32) with eltype Bool:
 ⋅
 1
 ⋅
 ⋅
 ⋅
 ⋅
...

julia> dump(onehot(cv))
OneHotVector{UInt32}
  indices: UInt32 0x00000002
  nlabels: Int64 26

Are these two integers all that's required, or are there more complicated examples?

mcabbott avatar Feb 28 '24 14:02 mcabbott

I think this is all, but I am not an expert on CategoricalArrays

schlichtanders avatar Feb 29 '24 07:02 schlichtanders

See #54 for a start. Probably need someone to come up with a list of CategoricalArrays examples worth testing.

mcabbott avatar May 02 '25 14:05 mcabbott