Consider shrinking list of dependencies
This package can be useful in other contexts without GPUs nor neural nets. Could you consider moving some of the dependencies to extensions? Below is the current list of hard dependencies:
[deps]
Adapt = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
ChainRulesCore = "d360d2e6-b24c-11e9-a2a3-2a2ae2dbcce4"
Compat = "34da2185-b29b-5c13-b0c7-acf172513d20"
GPUArraysCore = "46192b85-c4d5-4398-a991-12ede77f4527"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
NNlib = "872c559c-99b0-510c-b3b7-b6c96a88d5cd"
The problem is that the principle thing a OneHotArray is good for is that multiplying from the left with a matrix can be done efficiently... and that calls NNlib.scatter:
https://github.com/FluxML/OneHotArrays.jl/blob/0b49d1b39ad5d1ab8ed212da3ff2e180985e2591/src/linalg.jl#L7-L11
It this were moved to an extension, what should happen when I call rand(1:99, 4, 3) * onehotbatch([1,1,1], 1:3) without loading NNlib?
Isn't that a very specific use case? I'd expect a slow fallback or error message since these arrays aren't usually multiplied in other algorithms.
Em qua., 16 de abr. de 2025, 19:24, Michael Abbott @.***> escreveu:
The problem is that the principle thing a OneHotArray is good for is that multiplying from the left with a matrix can be done efficiently... and that calls NNlib.scatter:
https://github.com/FluxML/OneHotArrays.jl/blob/0b49d1b39ad5d1ab8ed212da3ff2e180985e2591/src/linalg.jl#L7-L11
It this were moved to an extension, what should happen when I call rand(1:99, 4, 3) * onehotbatch([1,1,1], 1:3) without loading NNlib?
— Reply to this email directly, view it on GitHub https://github.com/FluxML/OneHotArrays.jl/issues/50#issuecomment-2810950734, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZQW3IOGAFZVBOLDTNDQTD2Z3KDNAVCNFSM6AAAAAB3JBNDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQMJQHE2TANZTGQ . You are receiving this because you authored the thread.Message ID: @.***> mcabbott left a comment (FluxML/OneHotArrays.jl#50) https://github.com/FluxML/OneHotArrays.jl/issues/50#issuecomment-2810950734
The problem is that the principle thing a OneHotArray is good for is that multiplying from the left with a matrix can be done efficiently... and that calls NNlib.scatter:
https://github.com/FluxML/OneHotArrays.jl/blob/0b49d1b39ad5d1ab8ed212da3ff2e180985e2591/src/linalg.jl#L7-L11
It this were moved to an extension, what should happen when I call rand(1:99, 4, 3) * onehotbatch([1,1,1], 1:3) without loading NNlib?
— Reply to this email directly, view it on GitHub https://github.com/FluxML/OneHotArrays.jl/issues/50#issuecomment-2810950734, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZQW3IOGAFZVBOLDTNDQTD2Z3KDNAVCNFSM6AAAAAB3JBNDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQMJQHE2TANZTGQ . You are receiving this because you authored the thread.Message ID: @.***>
In particular, I need a simple OneHotMatrix implementation to store a huge dataset of categorical values. Say I have 10^6 values with 5 levels. A naive Matrix consumes 5 x 10^6 bools. The ideal OneHotMatrix would wrap the original vector of categorical values and would provide getindex and setindex! in terms of the one-hot encoding.
Example:
using CategoricalArrays
categs = categorical([:a, :b, :b, :c, :d, :e])
onehot = onehotmatrix(categs)
onehot[1,1] # true
onehot[2,1] # false
onehot[3,1] # false
...
onehot[1,2] # false
onehot[2,2] # true
onehot[3,2] # false
...
onehot[2,1] = true # equivalent to setting categs[1] = :a
Is this usage in the scope of OneHotArrays.jl? Should this onehotmatrix (and onecoldmatrix) function be implemented in CategoricalArrays.jl instead?
In my current understanding, I think it would be better to move all functionality to CategoricalArrays.jl, then add NNlib and GPU-related packages as extensions for the fast matrix multiplication. Is there a feature in OneHotArrays.jl that justifies the existence of both packages?
I have never used CategoricalArrays, but it seems to be backed by a Vector{UInt32} which is exactly what OneHotArray likes too:
julia> using CategoricalArrays, OneHotArrays
julia> categs = categorical(string.([:a, :b, :b, :c, :d, :e])); # error says Symbol no longer supported
julia> categs.refs
6-element Vector{UInt32}:
0x00000001
0x00000002
0x00000002
0x00000003
0x00000004
0x00000005
julia> oh = OneHotMatrix(categs.refs, Int(maximum(categs.refs)))
5×6 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
1 ⋅ ⋅ ⋅ ⋅ ⋅
⋅ 1 1 ⋅ ⋅ ⋅
⋅ ⋅ ⋅ 1 ⋅ ⋅
⋅ ⋅ ⋅ ⋅ 1 ⋅
⋅ ⋅ ⋅ ⋅ ⋅ 1
julia> oh[1,1], oh[2,1]
(true, false)
julia> oh[2,1] = true # with PR 51 here
true
julia> categs # has been updated too
6-element CategoricalArray{String,1,UInt32}:
"b"
"b"
"b"
"c"
"d"
"e"
For setindex! via #51 to mutate the CategoricalArray too, it matters that the OneHotMatrix constructor wraps the same Vector{UInt32}. That won't happen with oh2 = onehotbatch(categs, unique(categs)) below...
...and in general this package hasn't thought carefully about when it does and doesn't preserve such references, as everything was immutable anyway. Maybe it's always true that uppercase OneHotMatrix just wraps without copying?
julia> oh2 = onehotbatch(categs, unique(categs));
julia> oh2[3,1] = true;
julia> oh2
4×6 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
⋅ 1 1 ⋅ ⋅ ⋅
⋅ ⋅ ⋅ 1 ⋅ ⋅
1 ⋅ ⋅ ⋅ 1 ⋅
⋅ ⋅ ⋅ ⋅ ⋅ 1
julia> categs # not changed
6-element CategoricalArray{String,1,UInt32}:
"b"
"b"
"b"
"c"
"d"
"e"
That said I don't see a reason to regard this as out-of-scope. A method OneHotArray(:: CategoricalArray) could certainly live in an extension here. It will forget categs.pool, I hope that seems OK. Edit, xref #45
See #54 for a start on CategoricalArrays support... and having package extensions.