DataAPI.jl icon indicating copy to clipboard operation
DataAPI.jl copied to clipboard

Add isordered

Open bkamins opened this issue 4 years ago • 12 comments

Fixes https://github.com/JuliaData/DataAPI.jl/issues/23

bkamins avatar Oct 21 '20 19:10 bkamins

As I have commented to @nalimilan on slack. I think it is the time to rethink the API that DataAPI.jl for ref values should expose in general, as we are getting more container types supporting it. For example another function that CategoricalArrays.jl exposes is is droplevels! (or have droplevels that we do not have now that would create a new object - in case we do not require mutability which would make sense) - should this be supported in general (it would be useful to have it but I am not sure if it is feasible). Another question is if we think that DataAPI.refarray an be allowed to hold anything, or we want to be more restrictive and require that it holds positive integers + require that DataAPI.refpool should support AbstractVector interface then (such changes would make the API more useful in practice as then you could do many more common operations not caring about the concrete type that implements the API).

bkamins avatar Oct 27 '20 16:10 bkamins

The use case I was thinking of was in Econometrics.jl, I need to assert whether the outcome variable is ordinal or nominal for choosing the appropriate estimator. When people use CategoricalArray I can query that through isordered. The issue that has been happening is when datasets use the PooledArray which doesn't allow me to distinguish between the two cases.

Nosferican avatar Dec 14 '20 18:12 Nosferican

The issue that has been happening is when datasets use the PooledArray which doesn't allow me to distinguish between the two cases.

Yes but I think that's on purpose: we have split the old PooledDataArray into PooledArray and CategoricalArray to avoid the confusion between "efficiently store an array with few unique values" and "represent categorical data". People who have ordinal variables should use CategoricalArrays, not PooledArrays. What would be the point of having two packages for the same thing?

nalimilan avatar Dec 14 '20 18:12 nalimilan

What would be the point of having two packages for the same thing?

My perspective: PooledArrays.jl is lightweight and CategoricalArrays.jl is (unfortunately, but unavoidably) heavy because of this distinction.

bkamins avatar Dec 14 '20 19:12 bkamins

My perspective: PooledArrays.jl is lightweight and CategoricalArrays.jl is (unfortunately, but unavoidably) heavy because of this distinction.

I'm not sure that's true anymore now that we got rid of most invalidations.

nalimilan avatar Dec 14 '20 20:12 nalimilan

I'm not sure that's true anymore now that we got rid of most invalidations.

By "heavy" I did not mean problematic for Julia compiler, but rather that it has much more complex logic (e.g. the code for merging levels between two CategoricalArrays, having a custom CategoricalValue type etc.).

bkamins avatar Dec 14 '20 21:12 bkamins

I am not opposed to having different representations (e.g., efficient storage of few values vs categorical). What would be useful is to have any representation that does not support or uses an ordinal property to be queryable about it. Basically, provide an API that allows to get the information: does this representation offer this information / property and if it does how to query it.

Nosferican avatar Dec 14 '20 23:12 Nosferican

Maybe you should just assume that your users will be using scientific types from https://alan-turing-institute.github.io/MLJScientificTypes.jl/dev/?

bkamins avatar Dec 14 '20 23:12 bkamins

Likely not. Currently, the docs is to just use CategoricalArrays but every so often a user encounters an error due to using PooledArray. I am aiming to make the code more general and friendly rather than restricted / with more assumptions. However, having the API here might help clean up the MLJScientificTypes code.

Nosferican avatar Dec 14 '20 23:12 Nosferican

OK, then I think we should go with what I proposed above at https://github.com/JuliaData/DataAPI.jl/pull/26#discussion_r512873604.

nalimilan avatar Dec 28 '20 13:12 nalimilan

Notice that OrderedCollections.jl also defines isordered for ordered collections like OrderedDict.

Nosferican avatar Jan 26 '21 15:01 Nosferican

Good catch. Luckily it's not exported (https://github.com/JuliaCollections/OrderedCollections.jl/issues/45).

nalimilan avatar Jan 26 '21 16:01 nalimilan