CategoricalArrays.jl
CategoricalArrays.jl copied to clipboard
== for CategoricalPool
@nalimilan Currently:
x = categorical([1,2,3])
y = deepcopy(x)
x.pool == y.pool
produces false because for CategoricalPool the operation == is not defined and falls back to ===.
Is this intentional or should this be fixed?
Additionally I am not sure given
a = categorical([1,2,3], ordered=true)
b = categorical([1,2,3], ordered=true)
levels!(b, [3,2,1])
we should have a==b return true.
The similar situation is when we compare unordered and ordered categorical.
I can accept both approaches (current and strict - checking ordering of levels), but I would welcome if it were documented.
Given current approach what is the recommended way to check if two CategoricalArrays have the same ordering?
For arrays, I think the current definition is the best one since it's consistent with the AbstractArray fallback for ==. Any other definition would break that interface. You'll have to check equality of levels manually if you want a stricter test. In practice I don't think it will surprise anybody, and the inverse approach could give quite unexpected results.
Regarding pool equality, it's just that nobody cared enough to implement it. At this point CategoricalPool is an internal type, so only methods needed by CategoricalArray have been implemented (apart from some things which already existed). It would make sense to define == by calling == recursively on all fields (except the redundant ones).
At this point
CategoricalPoolis an internal type
This is the reason I have asked. This type is exported so actually it is external. Actually I would find it cleaner if it were not exported.
Funny, I didn't remember we exported it. I guess we could unexport it since it's not clear what API we want to support for people to work directly with CategoricalPool.
I would unexport it - I do not see a use case for users to work with CategoricalPool directly.
Should I make a PR?
Unless somebody objects, yes.
+1 for unexporting it
I would recommend not to export it. The reason is that it is designed for internal use and other applications might be misleading, e.g.:
julia> x = categorical(["b", "c", "b"])
3-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"b"
"c"
"b"
julia> x[1] = "a"
"a"
julia> levels(x)
3-element Array{String,1}:
"b"
"c"
"a"
julia> y = categorical(["a", "c", "b"])
3-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"a"
"c"
"b"
julia> levels!(y, ["b", "c", "a"])
3-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"a"
"c"
"b"
julia> levels(y)
3-element Array{String,1}:
"b"
"c"
"a"
And it is clear that x.pool and y.pool are tightly related to x and y themselves (they are different because x and y hava different refs field).
Looking at usage in Feather and RData (BTW I have proposed one fix there https://github.com/JuliaData/RData.jl/pull/39) what I think would be better is to expose a constructor like:
CategoricalArray(refs, levels, ordered)
I'm not sure what you mean with your examples. Can you explain the problem? It doesn't sound compeltely illegitimate in theory for users to create two CategoricalArray objects with the same pool. This could allow for efficient equality comparisons, e.g. for joins. I'm not sure we support it currently though.
Providing a higher-level APIs for common cases doesn't sound a bad idea anyway. Can you develop your proposal in a separate issue?
I can make a PR for CategoricalArray(refs, levels, ordered) constructor. But whether it is needed depends on this discussion.
The problem with exposing CategoricalPool is that it has "two parts":
- externally visible fields:
levelsandorder - internal fields: all other
Focusing on this issue (i.e. ==). Should then == for two CategoricalPool be over all fields, or only externally visible fields: levels and order.
My initial intention was to make == compare on "visible" fields. because I thought that a normal user will not care or understand internal implementation. I guess they would say that two CategoricalPools are the same if they encode the same levels and have the same ordering property, so something like (I write it here working on CategoricalArrays and exported functions as this is currently an exposed API):
function equalpool(x::CategoricalArray, y::CategoricalArray)
isordered(x) == isordered(y) || return false
lx, ly = levels(x), levels(y)
length(lx) == length(ly) || return false
isordered(x) && return lx == ly # if ordered order must be identical
isempty(setdiff(lx, ly)) # if unordered the set of levels must be identical
end
And for all this CategoricalPool does not have to be exported.
I understand what you say that:
==for twoCategoricalPoolshould return true only if all fields are exactly equal;- we use it not for saying that two pools are externally indistinguishable but that actually we can directly compare values in
refsfields to test for equality, which will improve the speed of comparison
I think this is useful (and in order for CategoricalArray match the speed of factor in R in certain cases necessary), but then we should probably make the fact that CategoricalArray has pool and refs fields a part of exposed API of CategoricalArray.
So in short - I am not against exporting CategoricalPool in general, but I feel that if we export it means that the whole design of CategoricalArray should be a public API of the package because only then CategoricalPool can be used meaningfully.
Observe that current use of CategoricalPool in RData and Feather actually makes use of unexported API. Why? Because:
- in R
factor/orderedmakes sure that integer representation of levels is such that 1 is the first level, 2 is the second level etc. as returned bylevelsin R; CategoricalPoolwhen passed onlylevelsby default also assigns them numbers the same way as R (but this is not a part of the exposed API - it is an implementation detail that might change in general - unless we make it a part of API).
Hi
Any updates on this?
I think things like this are unexpected:
julia> using CategoricalArrays
julia> arr = categorical(["a", "b"]);
julia> isless(arr[1], arr[2])
true
julia> isless(arr[1], deepcopy(arr[2]))
ERROR: ArgumentError: CategoricalValue objects with different pools cannot be tested for order
I do not know what @nalimilan but I would say that this is expected.
isless is currently designed so that it makes sense to be called only when comparing elements of the same CategoricalArray (actually to make sort work on it I guess).
For comparisons across two different CategoricalArrays you can use get:
julia> arr = categorical(["a", "b"]);
julia> arr2 = deepcopy(arr)
2-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"a"
"b"
julia> isless(arr[1], arr2[2])
ERROR: ArgumentError: CategoricalValue objects with different pools cannot be tested for order
Stacktrace:
[1] isless(::CategoricalArrays.CategoricalString{UInt32}, ::CategoricalArrays.CategoricalString{UInt32}) at D:\Software\JULIA_PKG\v0.6\CategoricalArrays\src\value.jl:127
julia> isless(get(arr[1]), get(arr2[2]))
true
It is unexpected to me because when one tries to use < or > on an unordered CategoricalVector{CategoricalString} then the error message says: ...Use isless instead, or call the ordered! function.... This leaves the impression that isless is for more than internal use (as you seem to suggest) and I would expect it to work on the CategoricalValue as long the pools match.
Perhaps it is then best to remove the mention of isless and just say that this comparison does not make sense for unordered CategoricalArray and to use ordered! to change that?
On a related note, for consistency I would expect this to also error:
categorical(["b"])[1] < "c"
Not that all this is awfully important.
I wanted to say that in general even if pool would match they could encode different concepts (probably a rare event of such a coincidence). So that is why I understand @nalimilan implemented it this way. So isless is not internal, but meant to use within the single CategoricalArray (the question is if we want to document it).
But I would be OK to relax it the way you suggest - up to @nalimilan to decide.
The second comparison categorical(["b"])[1] < "c" should work in my opinion as is, because here we explicitly want to unwrap categorical value to string.
@nalimilan - probably it is best if you comment on this because I am only guessing your intentions 😄.
I think it would be fine to allow < and isless when pools have exactly the same levels (in the same order). In practice it's going to be very slow, though. Do you want to make a PR? Shouldn't be hard to implement.
Happy to, but it might take me some time before I get to it. (I’m a bit overloaded currently.)
But just to confirm – it should error if isordered is false not use the “general” method (e.g., lexicographic ordering for CategoricalStrings)?
Sure, there's no hurry. < should error when isordered is false, but isless should succeed (as is the case currently) so that sorting works.