CategoricalArrays.jl icon indicating copy to clipboard operation
CategoricalArrays.jl copied to clipboard

Move CategoricalValue and CategoricalPool into separate package

Open davidanthoff opened this issue 7 years ago • 10 comments

I'm starting to think about davidanthoff/IterableTables.jl#2, and the whole design would be a lot easier if packages could take a dependency on CategoricalValue without taking a dependency on the whole CategoricalArrays package. Maybe a package called CategoricalValues.jl would work?

davidanthoff avatar Apr 17 '17 21:04 davidanthoff

I'm really not a fan of this idea. The two parts are really tightly coupled, it would make further development more cumbersome to keep them synchronized.

What's the problem with depending on CategoricalArrays? It's a small pure-Julia dependency. I don't think you'd notice the size difference if we split the package in two parts.

nalimilan avatar Apr 17 '17 21:04 nalimilan

My eventual plan for IterableTables is to move the integrations of various sources and sinks into the respective integrated packages. At that point packages like https://github.com/JuliaComputing/PooledArrays.jl and https://github.com/JuliaStats/DataArrays.jl would have to take a dependency on CategoricalArrays, which seems not ideal.

In some sense this is analog to the Nullable and NullableArrays situation: it is really good that one can use Nullable without having to take a dependency on NullableArrays.

davidanthoff avatar Apr 17 '17 21:04 davidanthoff

The difference with Nullable is that the type is defined in Base and it's very simple. OTOH CategoricalValue relies on CategoricalPool, so at that point you may as well include CategoricalArray in the package.

I don't really understand why PooledArrays or DataArrays would have to depend on CategoricalValue. Could you develop?

nalimilan avatar Apr 17 '17 21:04 nalimilan

at that point you may as well include CategoricalArray in the package.

And that pulls in NullableArrays, and it would just be a lot nicer and cleaner if that didn't happen.

I don't really understand why PooledArrays or DataArrays would have to depend on CategoricalValue. Could you develop?

Actually, it would be DataFrames and IndexedTables that need to take on a dependency on CategoricalValue, not the array packages, sorry.

If the sink for DataFrames gets an iterator of NamedTuples where one of the fields is of type CategoricalValue, it should use PooledDataArray for that column. Equally, if the sink for IndexedTables gets an iterator of NamedTuples where one of the fields is of type CategoricalValue, it should use PooledArray for that column. So both DataFrames and IndexedTables would need to know CategoricalValue, but not CategoricalArray.

davidanthoff avatar Apr 17 '17 21:04 davidanthoff

And that pulls in NullableArrays, and it would just be a lot nicer and cleaner if that didn't happen.

If that's the problem, then it can easily be fixed. Actually, we've discussed getting rid of this dependency in the past, as it's only used for the return type of unique(::NullableCategoricalArray). We could return an Array{Nullable} there since the number of unique values isn't supposed to be very large.

If the sink for DataFrames gets an iterator of NamedTuples where one of the fields is of type CategoricalValue, it should use PooledDataArray for that column. Equally, if the sink for IndexedTables gets an iterator of NamedTuples where one of the fields is of type CategoricalValue, it should use PooledArray for that column. So both DataFrames and IndexedTables would need to know CategoricalValue, but not CategoricalArray.

I think we need a more general abstraction for CategoricalArray/PooledDataArray/PooledArray. CategoricalValue isn't the best way of streaming data from a CategoricalVector to a PDA, since it carries a lot of useless information for that purpose and is a very inefficient way to extract the integer code for a given entry. I think we would better define a way for columns to come with meta-data giving a mapping from integer code to levels, and a way to iterate over the integer codes. That would not only be useful to convert between these types, but also to read data from files in formats like Stata (which associates labels to codes). We also need a generic way to get the levels in order to create model matrices in StatsModels.

nalimilan avatar Apr 18 '17 07:04 nalimilan

since it carries a lot of useless information for that purpose and is a very inefficient way to extract the integer code for a given entry.

Are you referring to the pointer to the pool?

I'll have to mull this whole area a bit more, it is not really clear to me what we want in this area... Especially when it comes to queries, it seems there is another lifting-like issue there, i.e. what to return if one applies some function onto a categorical value etc.

davidanthoff avatar Apr 18 '17 19:04 davidanthoff

Are you referring to the pointer to the pool?

Yes, but more generally it will be much more efficient to work directly with integer codes than creating a CategoricalValue (whatever its contents). Unless of course the compiler is able to optimize out the unnecessary operations, which I don't think occurs yet.

I'll have to mull this whole area a bit more, it is not really clear to me what we want in this area... Especially when it comes to queries, it seems there is another lifting-like issue there, i.e. what to return if one applies some function onto a categorical value etc.

Yeah, it's not very easy to decide what to do with operations on CategoricalValue. In general it seems automatic unwrapping should happen, except where a specific method is needed. In particular, it would make sense to have CategoricalValue{String} <: AbstractString if this kind of inheritance pattern was possible. Since it's not, I'm tempted to define CategoricalValue <: AbstractString, which will make it nicer to work with the most common case of CategoricalValue{String}. Then implementing the AbstractString interface with unwrapping methods should be all we need for a useful behavior.

nalimilan avatar Apr 19 '17 12:04 nalimilan

Yes, but more generally it will be much more efficient to work directly with integer codes than creating a CategoricalValue (whatever its contents). Unless of course the compiler is able to optimize out the unnecessary operations, which I don't think occurs yet.

Hm, I would have thought that the compiler could optimize that away, but you probably have better insight into that than me. For the whole IterableTables and Query world we won't be able to stream just Ints, it will have to be some immutable in any case.

Here is a crazy idea: could the levels be encoded as a type parameter? Something like this:

immutable CategoricalValue{LEVELS}
    index::Int
end

getvalue{LEVELS}(a::CategoricalValue{LEVELS}) = string(LEVELS.parameters[a.index].parameters[1])

a = CategoricalValue{Tuple{Val{:level1},Val{:level2}}}(2)

println(getvalue(a))

It does feel like a gross abuse of the type system and I could easily imagine that this is a really terrible idea, but maybe worth investigating a bit?

I'm tempted to define CategoricalValue <: AbstractString, which will make it nicer to work with the most common case of CategoricalValue{String}. Then implementing the AbstractString interface with unwrapping methods should be all we need for a useful behavior.

I like that a lot.

davidanthoff avatar Apr 19 '17 17:04 davidanthoff

I think the way forward here is to start moving to Union{CategoricalValue, Null} instead of Nullable{CategoricalValue}, which will allow us to get rid of the dependency on NullableArrays. See https://github.com/JuliaData/DataTables.jl/issues/62.

It does feel like a gross abuse of the type system and I could easily imagine that this is a really terrible idea, but maybe worth investigating a bit?

This would be essentially equivalent to creating an enum for each variable and storing an array of enum values. This can be efficient for some very specific use cases, but in general I don't think this would be a good idea as recompiling functions for each set of levels would be wasteful. This has been discussed before at https://github.com/JuliaStats/DataArrays.jl/issues/50 and https://github.com/JuliaStats/DataArrays.jl/issues/73.

nalimilan avatar May 14 '17 13:05 nalimilan

This seems a bit moot since the small unions era.

Nosferican avatar Feb 26 '19 19:02 Nosferican