CategoricalArrays.jl icon indicating copy to clipboard operation
CategoricalArrays.jl copied to clipboard

== for CategoricalPool

Open bkamins opened this issue 7 years ago • 17 comments

@nalimilan Currently:

x = categorical([1,2,3])
y = deepcopy(x)
x.pool == y.pool

produces false because for CategoricalPool the operation == is not defined and falls back to ===.

Is this intentional or should this be fixed?

bkamins avatar Dec 10 '17 21:12 bkamins

Additionally I am not sure given

a = categorical([1,2,3], ordered=true)
b = categorical([1,2,3], ordered=true)
levels!(b, [3,2,1])

we should have a==b return true.

The similar situation is when we compare unordered and ordered categorical.

I can accept both approaches (current and strict - checking ordering of levels), but I would welcome if it were documented.

Given current approach what is the recommended way to check if two CategoricalArrays have the same ordering?

bkamins avatar Dec 11 '17 08:12 bkamins

For arrays, I think the current definition is the best one since it's consistent with the AbstractArray fallback for ==. Any other definition would break that interface. You'll have to check equality of levels manually if you want a stricter test. In practice I don't think it will surprise anybody, and the inverse approach could give quite unexpected results.

Regarding pool equality, it's just that nobody cared enough to implement it. At this point CategoricalPool is an internal type, so only methods needed by CategoricalArray have been implemented (apart from some things which already existed). It would make sense to define == by calling == recursively on all fields (except the redundant ones).

nalimilan avatar Dec 11 '17 09:12 nalimilan

At this point CategoricalPool is an internal type

This is the reason I have asked. This type is exported so actually it is external. Actually I would find it cleaner if it were not exported.

bkamins avatar Dec 11 '17 12:12 bkamins

Funny, I didn't remember we exported it. I guess we could unexport it since it's not clear what API we want to support for people to work directly with CategoricalPool.

nalimilan avatar Dec 11 '17 12:12 nalimilan

I would unexport it - I do not see a use case for users to work with CategoricalPool directly. Should I make a PR?

bkamins avatar Dec 11 '17 16:12 bkamins

Unless somebody objects, yes.

nalimilan avatar Dec 11 '17 16:12 nalimilan

+1 for unexporting it

ararslan avatar Dec 11 '17 19:12 ararslan

I would recommend not to export it. The reason is that it is designed for internal use and other applications might be misleading, e.g.:

julia> x = categorical(["b", "c", "b"])
3-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "b"
 "c"
 "b"

julia> x[1] = "a"
"a"

julia> levels(x)
3-element Array{String,1}:
 "b"
 "c"
 "a"

julia> y = categorical(["a", "c", "b"])
3-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "a"
 "c"
 "b"

julia> levels!(y, ["b", "c", "a"])
3-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "a"
 "c"
 "b"

julia> levels(y)
3-element Array{String,1}:
 "b"
 "c"
 "a"

And it is clear that x.pool and y.pool are tightly related to x and y themselves (they are different because x and y hava different refs field).

Looking at usage in Feather and RData (BTW I have proposed one fix there https://github.com/JuliaData/RData.jl/pull/39) what I think would be better is to expose a constructor like:

CategoricalArray(refs, levels, ordered)

bkamins avatar Dec 26 '17 20:12 bkamins

I'm not sure what you mean with your examples. Can you explain the problem? It doesn't sound compeltely illegitimate in theory for users to create two CategoricalArray objects with the same pool. This could allow for efficient equality comparisons, e.g. for joins. I'm not sure we support it currently though.

Providing a higher-level APIs for common cases doesn't sound a bad idea anyway. Can you develop your proposal in a separate issue?

nalimilan avatar Dec 26 '17 21:12 nalimilan

I can make a PR for CategoricalArray(refs, levels, ordered) constructor. But whether it is needed depends on this discussion.

The problem with exposing CategoricalPool is that it has "two parts":

  • externally visible fields: levels and order
  • internal fields: all other

Focusing on this issue (i.e. ==). Should then == for two CategoricalPool be over all fields, or only externally visible fields: levels and order.

My initial intention was to make == compare on "visible" fields. because I thought that a normal user will not care or understand internal implementation. I guess they would say that two CategoricalPools are the same if they encode the same levels and have the same ordering property, so something like (I write it here working on CategoricalArrays and exported functions as this is currently an exposed API):

function equalpool(x::CategoricalArray, y::CategoricalArray)
    isordered(x) == isordered(y) || return false
    lx, ly = levels(x), levels(y)
    length(lx) == length(ly) || return false
    isordered(x) && return lx == ly # if ordered order must be identical
    isempty(setdiff(lx, ly)) # if unordered the set of levels must be identical
end

And for all this CategoricalPool does not have to be exported.

I understand what you say that:

  • == for two CategoricalPool should return true only if all fields are exactly equal;
  • we use it not for saying that two pools are externally indistinguishable but that actually we can directly compare values in refs fields to test for equality, which will improve the speed of comparison

I think this is useful (and in order for CategoricalArray match the speed of factor in R in certain cases necessary), but then we should probably make the fact that CategoricalArray has pool and refs fields a part of exposed API of CategoricalArray.

So in short - I am not against exporting CategoricalPool in general, but I feel that if we export it means that the whole design of CategoricalArray should be a public API of the package because only then CategoricalPool can be used meaningfully.

Observe that current use of CategoricalPool in RData and Feather actually makes use of unexported API. Why? Because:

  • in R factor/ordered makes sure that integer representation of levels is such that 1 is the first level, 2 is the second level etc. as returned by levels in R;
  • CategoricalPool when passed only levels by default also assigns them numbers the same way as R (but this is not a part of the exposed API - it is an implementation detail that might change in general - unless we make it a part of API).

bkamins avatar Dec 26 '17 23:12 bkamins

Hi

Any updates on this?

I think things like this are unexpected:

julia> using CategoricalArrays
julia> arr = categorical(["a", "b"]);
julia> isless(arr[1], arr[2])
true
julia> isless(arr[1], deepcopy(arr[2]))
ERROR: ArgumentError: CategoricalValue objects with different pools cannot be tested for order

ValdarT avatar Feb 04 '18 09:02 ValdarT

I do not know what @nalimilan but I would say that this is expected.

isless is currently designed so that it makes sense to be called only when comparing elements of the same CategoricalArray (actually to make sort work on it I guess).

For comparisons across two different CategoricalArrays you can use get:

julia> arr = categorical(["a", "b"]);

julia> arr2 = deepcopy(arr)
2-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "a"
 "b"

julia> isless(arr[1], arr2[2])
ERROR: ArgumentError: CategoricalValue objects with different pools cannot be tested for order
Stacktrace:
 [1] isless(::CategoricalArrays.CategoricalString{UInt32}, ::CategoricalArrays.CategoricalString{UInt32}) at D:\Software\JULIA_PKG\v0.6\CategoricalArrays\src\value.jl:127

julia> isless(get(arr[1]), get(arr2[2]))
true

bkamins avatar Feb 04 '18 10:02 bkamins

It is unexpected to me because when one tries to use < or > on an unordered CategoricalVector{CategoricalString} then the error message says: ...Use isless instead, or call the ordered! function.... This leaves the impression that isless is for more than internal use (as you seem to suggest) and I would expect it to work on the CategoricalValue as long the pools match. Perhaps it is then best to remove the mention of isless and just say that this comparison does not make sense for unordered CategoricalArray and to use ordered! to change that?

On a related note, for consistency I would expect this to also error:

categorical(["b"])[1] < "c"

Not that all this is awfully important.

ValdarT avatar Feb 04 '18 11:02 ValdarT

I wanted to say that in general even if pool would match they could encode different concepts (probably a rare event of such a coincidence). So that is why I understand @nalimilan implemented it this way. So isless is not internal, but meant to use within the single CategoricalArray (the question is if we want to document it). But I would be OK to relax it the way you suggest - up to @nalimilan to decide.

The second comparison categorical(["b"])[1] < "c" should work in my opinion as is, because here we explicitly want to unwrap categorical value to string.

@nalimilan - probably it is best if you comment on this because I am only guessing your intentions 😄.

bkamins avatar Feb 04 '18 12:02 bkamins

I think it would be fine to allow < and isless when pools have exactly the same levels (in the same order). In practice it's going to be very slow, though. Do you want to make a PR? Shouldn't be hard to implement.

nalimilan avatar Feb 04 '18 14:02 nalimilan

Happy to, but it might take me some time before I get to it. (I’m a bit overloaded currently.) But just to confirm – it should error if isordered is false not use the “general” method (e.g., lexicographic ordering for CategoricalStrings)?

ValdarT avatar Feb 04 '18 17:02 ValdarT

Sure, there's no hurry. < should error when isordered is false, but isless should succeed (as is the case currently) so that sorting works.

nalimilan avatar Feb 04 '18 17:02 nalimilan