CategoricalArrays.jl
CategoricalArrays.jl copied to clipboard
Move CategoricalValue and CategoricalPool into separate package
I'm starting to think about davidanthoff/IterableTables.jl#2, and the whole design would be a lot easier if packages could take a dependency on CategoricalValue
without taking a dependency on the whole CategoricalArrays
package. Maybe a package called CategoricalValues.jl
would work?
I'm really not a fan of this idea. The two parts are really tightly coupled, it would make further development more cumbersome to keep them synchronized.
What's the problem with depending on CategoricalArrays? It's a small pure-Julia dependency. I don't think you'd notice the size difference if we split the package in two parts.
My eventual plan for IterableTables
is to move the integrations of various sources and sinks into the respective integrated packages. At that point packages like https://github.com/JuliaComputing/PooledArrays.jl and https://github.com/JuliaStats/DataArrays.jl would have to take a dependency on CategoricalArrays, which seems not ideal.
In some sense this is analog to the Nullable
and NullableArrays
situation: it is really good that one can use Nullable
without having to take a dependency on NullableArrays
.
The difference with Nullable
is that the type is defined in Base and it's very simple. OTOH CategoricalValue
relies on CategoricalPool
, so at that point you may as well include CategoricalArray
in the package.
I don't really understand why PooledArrays or DataArrays would have to depend on CategoricalValue
. Could you develop?
at that point you may as well include CategoricalArray in the package.
And that pulls in NullableArrays
, and it would just be a lot nicer and cleaner if that didn't happen.
I don't really understand why PooledArrays or DataArrays would have to depend on CategoricalValue. Could you develop?
Actually, it would be DataFrames
and IndexedTables
that need to take on a dependency on CategoricalValue
, not the array packages, sorry.
If the sink for DataFrames
gets an iterator of NamedTuples
where one of the fields is of type CategoricalValue
, it should use PooledDataArray
for that column. Equally, if the sink for IndexedTables
gets an iterator of NamedTuples
where one of the fields is of type CategoricalValue
, it should use PooledArray
for that column. So both DataFrames
and IndexedTables
would need to know CategoricalValue
, but not CategoricalArray
.
And that pulls in NullableArrays, and it would just be a lot nicer and cleaner if that didn't happen.
If that's the problem, then it can easily be fixed. Actually, we've discussed getting rid of this dependency in the past, as it's only used for the return type of unique(::NullableCategoricalArray)
. We could return an Array{Nullable}
there since the number of unique values isn't supposed to be very large.
If the sink for DataFrames gets an iterator of NamedTuples where one of the fields is of type CategoricalValue, it should use PooledDataArray for that column. Equally, if the sink for IndexedTables gets an iterator of NamedTuples where one of the fields is of type CategoricalValue, it should use PooledArray for that column. So both DataFrames and IndexedTables would need to know CategoricalValue, but not CategoricalArray.
I think we need a more general abstraction for CategoricalArray
/PooledDataArray
/PooledArray
. CategoricalValue
isn't the best way of streaming data from a CategoricalVector
to a PDA, since it carries a lot of useless information for that purpose and is a very inefficient way to extract the integer code for a given entry. I think we would better define a way for columns to come with meta-data giving a mapping from integer code to levels, and a way to iterate over the integer codes. That would not only be useful to convert between these types, but also to read data from files in formats like Stata (which associates labels to codes). We also need a generic way to get the levels in order to create model matrices in StatsModels.
since it carries a lot of useless information for that purpose and is a very inefficient way to extract the integer code for a given entry.
Are you referring to the pointer to the pool?
I'll have to mull this whole area a bit more, it is not really clear to me what we want in this area... Especially when it comes to queries, it seems there is another lifting-like issue there, i.e. what to return if one applies some function onto a categorical value etc.
Are you referring to the pointer to the pool?
Yes, but more generally it will be much more efficient to work directly with integer codes than creating a CategoricalValue
(whatever its contents). Unless of course the compiler is able to optimize out the unnecessary operations, which I don't think occurs yet.
I'll have to mull this whole area a bit more, it is not really clear to me what we want in this area... Especially when it comes to queries, it seems there is another lifting-like issue there, i.e. what to return if one applies some function onto a categorical value etc.
Yeah, it's not very easy to decide what to do with operations on CategoricalValue
. In general it seems automatic unwrapping should happen, except where a specific method is needed. In particular, it would make sense to have CategoricalValue{String} <: AbstractString
if this kind of inheritance pattern was possible. Since it's not, I'm tempted to define CategoricalValue <: AbstractString
, which will make it nicer to work with the most common case of CategoricalValue{String}
. Then implementing the AbstractString
interface with unwrapping methods should be all we need for a useful behavior.
Yes, but more generally it will be much more efficient to work directly with integer codes than creating a CategoricalValue (whatever its contents). Unless of course the compiler is able to optimize out the unnecessary operations, which I don't think occurs yet.
Hm, I would have thought that the compiler could optimize that away, but you probably have better insight into that than me. For the whole IterableTables
and Query
world we won't be able to stream just Int
s, it will have to be some immutable
in any case.
Here is a crazy idea: could the levels be encoded as a type parameter? Something like this:
immutable CategoricalValue{LEVELS}
index::Int
end
getvalue{LEVELS}(a::CategoricalValue{LEVELS}) = string(LEVELS.parameters[a.index].parameters[1])
a = CategoricalValue{Tuple{Val{:level1},Val{:level2}}}(2)
println(getvalue(a))
It does feel like a gross abuse of the type system and I could easily imagine that this is a really terrible idea, but maybe worth investigating a bit?
I'm tempted to define CategoricalValue <: AbstractString, which will make it nicer to work with the most common case of CategoricalValue{String}. Then implementing the AbstractString interface with unwrapping methods should be all we need for a useful behavior.
I like that a lot.
I think the way forward here is to start moving to Union{CategoricalValue, Null}
instead of Nullable{CategoricalValue}
, which will allow us to get rid of the dependency on NullableArrays. See https://github.com/JuliaData/DataTables.jl/issues/62.
It does feel like a gross abuse of the type system and I could easily imagine that this is a really terrible idea, but maybe worth investigating a bit?
This would be essentially equivalent to creating an enum for each variable and storing an array of enum values. This can be efficient for some very specific use cases, but in general I don't think this would be a good idea as recompiling functions for each set of levels would be wasteful. This has been discussed before at https://github.com/JuliaStats/DataArrays.jl/issues/50 and https://github.com/JuliaStats/DataArrays.jl/issues/73.
This seems a bit moot since the small unions era.