AxisKeys.jl icon indicating copy to clipboard operation
AxisKeys.jl copied to clipboard

sortkeys() changes key container type

Open takbal opened this issue 4 years ago • 5 comments

I think it is reasonable to expect that sorting keys should not change the key container type. This is not currently the case:

using AxisKeys, UniqueVectors

a = wrapdims(rand(2), UniqueVector, x=1:2)

println(typeof(axiskeys(a,:x)))

a = sortkeys(a)

println(typeof(axiskeys(a,:x)))

Produces:

UniqueVector{Int64}
Array{Int64,1}

takbal avatar Apr 16 '21 10:04 takbal

That would be nice to have, but seems tricky to ensure. It does this:

a = wrapdims(rand(2), UniqueVector, x=1:2)
perm = sortperm(a.x)
wrapdims(parent(a)[perm], x=a.x[perm])

and a.x[perm] doesn't know that this is a permutation -- getindex must equally expect e.g. a.x[[1,1,2]].

It might be possible to call sort(a.x) again, and trust that this will produce the same order? UniqueVectors can (and probably should) overload this to preserve the type.

mcabbott avatar Apr 16 '21 13:04 mcabbott

UniqueVectors also produce an Array after sort(), but the in-place sort!() works. So this should do it for this container:

perm = sortperm(a.x)
newkeys = copy(a.x)
sort!(newkeys)
wrapdims(parent(a)[perm], x=newkeys)

Not sure how general this is for other containers, or if sort! and sortperm is guaranteed to produce the same order.

takbal avatar Apr 16 '21 14:04 takbal

I get an error when I try sort!:

u = UniqueVector([41, 46, 19, 47, 21, 27, 16, 25, 45])
findfirst(isequal(u[5]), u) # fast method
su = sort!(u)  # ArgumentError: cannot set an element that exists elsewhere in UniqueVector
@which sort!(u) # generic one, from Base.Sort

But also, I don't think this can work in general, e.g.

b = wrapdims(rand(3), y = 'c':-1:'a')
sortkeys(b, dims=:y)

mcabbott avatar Apr 16 '21 14:04 mcabbott

Indeed. UniqueVector seems to have even more fundamental problems, as views or indexing also change the key container type.

A possible solution is to define a function that enforces the key containers:

convert_kc(K::KeyedArray, container_type::Type=UniqueVector)::KeyedArray =
    KeyedArray( NamedDimsArray( parent(parent(K)), dimnames(K)), tuple( [ container_type(x) for x in axiskeys(K) ]...) )

and then call it after each operation that may need accelerated access in its output:

K = convert_kc(sortkeys(K))
K = convert_kc([K1 ; K2])
...

This is not pretty. Maybe KeyedArray could have a static Bool type parameter that, if true, calls this key container conversion at the end of each AxisKeys function that generates a KeyedArray result?

takbal avatar Apr 17 '21 15:04 takbal

Simple views behave well, e.g. @which findfirst(isequal(47), view(u, 2:7)). But for u[2:7] the cost of re-generating the lookup dictionary was felt to be too much, in https://github.com/garrison/UniqueVectors.jl/pull/9. Maybe you could do better, don't re-hash, just update the indices in the dictionary? Although u[inds] isn't always unique...

Base has a function permute! but no permute, which would be the perfect thing to overload here.

A difficulty with a boolean flag is that the function needed to reconstruct a given type isn't obvious from the type. It could cary around this function, though. Still adds a fair bit of complication.

mcabbott avatar Apr 17 '21 15:04 mcabbott