OneHotArrays.jl icon indicating copy to clipboard operation
OneHotArrays.jl copied to clipboard

Consider shrinking list of dependencies

Open juliohm opened this issue 8 months ago • 5 comments

This package can be useful in other contexts without GPUs nor neural nets. Could you consider moving some of the dependencies to extensions? Below is the current list of hard dependencies:

[deps]
Adapt = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
ChainRulesCore = "d360d2e6-b24c-11e9-a2a3-2a2ae2dbcce4"
Compat = "34da2185-b29b-5c13-b0c7-acf172513d20"
GPUArraysCore = "46192b85-c4d5-4398-a991-12ede77f4527"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
NNlib = "872c559c-99b0-510c-b3b7-b6c96a88d5cd"

juliohm avatar Apr 16 '25 18:04 juliohm

The problem is that the principle thing a OneHotArray is good for is that multiplying from the left with a matrix can be done efficiently... and that calls NNlib.scatter:

https://github.com/FluxML/OneHotArrays.jl/blob/0b49d1b39ad5d1ab8ed212da3ff2e180985e2591/src/linalg.jl#L7-L11

It this were moved to an extension, what should happen when I call rand(1:99, 4, 3) * onehotbatch([1,1,1], 1:3) without loading NNlib?

mcabbott avatar Apr 16 '25 22:04 mcabbott

Isn't that a very specific use case? I'd expect a slow fallback or error message since these arrays aren't usually multiplied in other algorithms.

Em qua., 16 de abr. de 2025, 19:24, Michael Abbott @.***> escreveu:

The problem is that the principle thing a OneHotArray is good for is that multiplying from the left with a matrix can be done efficiently... and that calls NNlib.scatter:

https://github.com/FluxML/OneHotArrays.jl/blob/0b49d1b39ad5d1ab8ed212da3ff2e180985e2591/src/linalg.jl#L7-L11

It this were moved to an extension, what should happen when I call rand(1:99, 4, 3) * onehotbatch([1,1,1], 1:3) without loading NNlib?

— Reply to this email directly, view it on GitHub https://github.com/FluxML/OneHotArrays.jl/issues/50#issuecomment-2810950734, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZQW3IOGAFZVBOLDTNDQTD2Z3KDNAVCNFSM6AAAAAB3JBNDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQMJQHE2TANZTGQ . You are receiving this because you authored the thread.Message ID: @.***> mcabbott left a comment (FluxML/OneHotArrays.jl#50) https://github.com/FluxML/OneHotArrays.jl/issues/50#issuecomment-2810950734

The problem is that the principle thing a OneHotArray is good for is that multiplying from the left with a matrix can be done efficiently... and that calls NNlib.scatter:

https://github.com/FluxML/OneHotArrays.jl/blob/0b49d1b39ad5d1ab8ed212da3ff2e180985e2591/src/linalg.jl#L7-L11

It this were moved to an extension, what should happen when I call rand(1:99, 4, 3) * onehotbatch([1,1,1], 1:3) without loading NNlib?

— Reply to this email directly, view it on GitHub https://github.com/FluxML/OneHotArrays.jl/issues/50#issuecomment-2810950734, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZQW3IOGAFZVBOLDTNDQTD2Z3KDNAVCNFSM6AAAAAB3JBNDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQMJQHE2TANZTGQ . You are receiving this because you authored the thread.Message ID: @.***>

juliohm avatar Apr 16 '25 22:04 juliohm

In particular, I need a simple OneHotMatrix implementation to store a huge dataset of categorical values. Say I have 10^6 values with 5 levels. A naive Matrix consumes 5 x 10^6 bools. The ideal OneHotMatrix would wrap the original vector of categorical values and would provide getindex and setindex! in terms of the one-hot encoding.

Example:

using CategoricalArrays

categs = categorical([:a, :b, :b, :c, :d, :e])
onehot = onehotmatrix(categs)

onehot[1,1] # true
onehot[2,1] # false
onehot[3,1] # false
...

onehot[1,2] # false
onehot[2,2] # true
onehot[3,2] # false
...

onehot[2,1] = true # equivalent to setting categs[1] = :a

Is this usage in the scope of OneHotArrays.jl? Should this onehotmatrix (and onecoldmatrix) function be implemented in CategoricalArrays.jl instead?

In my current understanding, I think it would be better to move all functionality to CategoricalArrays.jl, then add NNlib and GPU-related packages as extensions for the fast matrix multiplication. Is there a feature in OneHotArrays.jl that justifies the existence of both packages?

juliohm avatar Apr 17 '25 12:04 juliohm

I have never used CategoricalArrays, but it seems to be backed by a Vector{UInt32} which is exactly what OneHotArray likes too:

julia> using CategoricalArrays, OneHotArrays

julia> categs = categorical(string.([:a, :b, :b, :c, :d, :e]));  # error says Symbol no longer supported

julia> categs.refs
6-element Vector{UInt32}:
 0x00000001
 0x00000002
 0x00000002
 0x00000003
 0x00000004
 0x00000005

julia> oh = OneHotMatrix(categs.refs, Int(maximum(categs.refs)))
5×6 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
 1  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  1  1  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  1  ⋅  ⋅
 ⋅  ⋅  ⋅  ⋅  1  ⋅
 ⋅  ⋅  ⋅  ⋅  ⋅  1

julia> oh[1,1], oh[2,1]
(true, false)

julia> oh[2,1] = true  # with PR 51 here
true

julia> categs  # has been updated too
6-element CategoricalArray{String,1,UInt32}:
 "b"
 "b"
 "b"
 "c"
 "d"
 "e"

For setindex! via #51 to mutate the CategoricalArray too, it matters that the OneHotMatrix constructor wraps the same Vector{UInt32}. That won't happen with oh2 = onehotbatch(categs, unique(categs)) below...

...and in general this package hasn't thought carefully about when it does and doesn't preserve such references, as everything was immutable anyway. Maybe it's always true that uppercase OneHotMatrix just wraps without copying?

julia> oh2 = onehotbatch(categs, unique(categs));

julia> oh2[3,1] = true;

julia> oh2
4×6 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
 ⋅  1  1  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  1  ⋅  ⋅
 1  ⋅  ⋅  ⋅  1  ⋅
 ⋅  ⋅  ⋅  ⋅  ⋅  1

julia> categs  # not changed
6-element CategoricalArray{String,1,UInt32}:
 "b"
 "b"
 "b"
 "c"
 "d"
 "e"

That said I don't see a reason to regard this as out-of-scope. A method OneHotArray(:: CategoricalArray) could certainly live in an extension here. It will forget categs.pool, I hope that seems OK. Edit, xref #45

mcabbott avatar Apr 21 '25 21:04 mcabbott

See #54 for a start on CategoricalArrays support... and having package extensions.

mcabbott avatar May 02 '25 14:05 mcabbott