DataArrays.jl icon indicating copy to clipboard operation
DataArrays.jl copied to clipboard

Category to integer mapping: bikeshedding session

Open johnmyleswhite opened this issue 11 years ago • 9 comments

After merging #52, we should provide a tool that constructs a mapping from the levels of a PooledDataArray to the integers. This function should make clear that the mapping is ad hoc and not related to the underlying representation of the data.

Should we call it levelsmap?

johnmyleswhite avatar Jan 07 '14 15:01 johnmyleswhite

I believe this is a very basic device that is useful out of this package. For example, it may also be useful in constructing contingency tables (see https://github.com/JuliaStats/Stats.jl/issues/32).

What about we implement this in Stats.jl and thus provide such support to other packages that may also want it?

lindahua avatar Jan 07 '14 15:01 lindahua

That works for me. What interface will we be using? Something like `levelsmap(["a", "b", "A", "A"]) -> ["a" => 1, "b" => 2, "A" => 3]?

johnmyleswhite avatar Jan 07 '14 15:01 johnmyleswhite

FWIW, there's already indexmap in Stats.jl. (see https://github.com/JuliaStats/Stats.jl#miscelleneous-functions)

lindahua avatar Jan 07 '14 15:01 lindahua

We don't want to use that as our canonical numbering, right? That could produce numbers that are spaced very unevenly.

johnmyleswhite avatar Jan 07 '14 15:01 johnmyleswhite

Oh, yes ... we can then add a levelmap method for this?

lindahua avatar Jan 07 '14 15:01 lindahua

That seems right to me. Should I submit a PR to discuss implementation details?

johnmyleswhite avatar Jan 07 '14 15:01 johnmyleswhite

Sure.

lindahua avatar Jan 07 '14 15:01 lindahua

Think about it more. Probably, we may want a data structure that maintain cross-reference between levels & indexes.

Something along this line?

immutable LevelMap{T}
    levels::Vector{T}   # index -> level
    indmap::Dict{T, Int}   # level -> index
end

with some functions to make doing the translation convenient.

lindahua avatar Jan 07 '14 15:01 lindahua

Not sure. For involved indexing operations, it seems like you'd want to maintain all of the indices for each level since that would make it much easier to repeat the levels calculation on subsets of the data.

johnmyleswhite avatar Jan 07 '14 16:01 johnmyleswhite