DimensionalData.jl
DimensionalData.jl copied to clipboard
Allow Between on `Unordered`
Hi,
I'm created a DimArray
with multiple dimensions where the dimensions have categorical data. Sometimes I get Unordered
or ForwardOrdered
dimensions. The categorical data is always an eltype(dim) == String
.
I'm assuming it does an alphabetical sort and checks if that matches the original data and if it does it creates a ForwardOrdered
and otherwise it creates Unordered
.
Is there a way to force it to always create ForwardOrdered
dimensions? We have strings that are not ordered alphabetically but are always going to be in the same order and we would like to use Between("String1", "String2")
in some cases.
Thanks!
You can fully construct a Categorical
or Sampled
lookup yourself if you know the values. Rasters.jl does this.
Autodection is for end users - strings and symbols are always categorical, and sorting is checked in the constructor.
But the order
actually has meaning in dispatch - it tells us if we can do a binary search with searchsortedfirst
and which direction to do it in. Unordered
needs to use the slower findfirst
.
If you use the wrong type you will get wrong answers in At and Between.
I guess we could make Between work on Unordered
?
It just needs a between
function for that using findfirst
. Its in the selectors.jl file in the LookupArrays module.
(This wasnt implemented because Between is mostly synonymous and now replaced by ..
and selecting intervals over unsorted lists is kinda weird)
You can fully construct a Categorical or Sampled lookup yourself if you know the values.
Do you have an example of this? I would like to construct @dim Enduse
or @dim Tech
as well to build types of these dimensions but I'm only seeing examples of Categorical
data used with X
or Y
I was expecting something like this to work but I'm getting errors:
julia> @dim Tech
julia> DimArray(zeros(2,2), (Enduse = ["hello", "world"], Tech = Tech(DD.Categorical(["hello", "world"], order=DimensionalData.ForwardOrdered()))))
ERROR: MethodError: no method matching format(::Tech{DimensionalData.Dimensions.LookupArrays.Categorical{String, Vector{String}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, ::Type{Tech}, ::Base.OneTo{Int64})
Some of our data has the following form (or at least in debugging it'll be useful to be able to generate this following form):
julia> Enduse_short_name = ["SF", "MF"];
julia> Enduse_long_name = ["Single Family", "Small Family"];
julia> @assert length(Enduse_short_name) == length(Enduse_long_name)
julia> @assert sort(Enduse_short_name) != Enduse_short_name
julia> @assert sort(Enduse_long_name) == Enduse_long_name
variable1 = DimArray(..., [Enduse => Enduse_short_name, ...])
variable2 = DimArray(..., [Enduse => Enduse_long_name, ...])
variable1[Enduse = Between("SF", "MF")] == variable2[Enduse = Between("Small Family", "Single Family")]
The type that gets generated for variable1
is not the same as the type that gets generated for variable2
. So right now, Between
works only for the sorted dimension.
I would basically like to force ForwardOrdered
for all the categorical data that I have.
I constructed them as so:
dimY = Y(Categorical([:al,:bet,:gim,:dal]; order = ForwardOrdered()))
dimX = X(Categorical([:a,:b,:g,:d]; order = ForwardOrdered()))
Then:
DimArray(zeros(4,4), (dimX,dimY))
... which results in (for me):
4×4 DimArray{Float64,2} with dimensions:
X Categorical{Symbol} Symbol[:a, :b, :g, :d] ForwardOrdered,
Y Categorical{Symbol} Symbol[:al, :bet, :gim, :dal] ForwardOrdered
:al :bet :gim :dal
:a 0.0 0.0 0.0 0.0
:b 0.0 0.0 0.0 0.0
:g 0.0 0.0 0.0 0.0
:d 0.0 0.0 0.0 0.0
@kdheepak there is just a mistake in your code - you have Tech
twice:
DimArray(zeros(2,2), (Enduse = ["hello", "world"], Tech = Tech(DD.Categorical(["hello", "world"], order=DimensionalData.ForwardOrdered()))))
You need to either use the NamedTuple syntax where the names are the dimension type and the value is the Array or LookupArray:
DimArray(zeros(2,2), (Enduse = ["hello", "world"], Tech = DD.Categorical(["hello", "world"], order=DimensionalData.ForwardOrdered())))
or use a Tuple syntax where the tuple contains Dimensions
DimArray(zeros(2,2), (Enduse(["hello", "world"]), Tech(DD.Categorical(["hello", "world"], order=DimensionalData.ForwardOrdered()))))
We probably need a better error message for when you mix them, I've never had time to make the error messages better.
@pmodv do notice that your Y dimension is broken - it isn't ForwardOrdered
. You will get the wrong results with At
because it will try to use searchsortedfirst
. If the lookup is not sorted it needs to be Unordered
.
Instead of forcing ForwardOrdered
the real question is from above:
I guess we could make Between work on Unordered ?
Thanks for the quick replies! Removing the Tech
constructor works. That resolves this issue I think and it can be closed.
I'm curious. If I execute @dim Tech
after constructing a DimArray
, the old array has Dim{:Tech}
but any new arrays constructed has Tech
:
julia> using DimensionalData: @dim
julia> using DimensionalData
julia> import DimensionalData as DD
julia> arr = DimArray(zeros(2,2), (Enduse = ["hello", "world"], Tech = DD.Categorical(["zzz", "world"], order=DimensionalData.ForwardOrdered())))
2×2 DimArray{Float64,2} with dimensions:
Dim{:Enduse} Categorical{String} String["hello", "world"] ForwardOrdered,
Dim{:Tech} Categorical{String} String["zzz", "world"] ForwardOrdered
"zzz" "world"
"hello" 0.0 0.0
"world" 0.0 0.0
julia> @dim Tech
julia> arr
2×2 DimArray{Float64,2} with dimensions:
Dim{:Enduse} Categorical{String} String["hello", "world"] ForwardOrdered,
Dim{:Tech} Categorical{String} String["zzz", "world"] ForwardOrdered
"zzz" "world"
"hello" 0.0 0.0
"world" 0.0 0.0
julia> arr = DimArray(zeros(2,2), (Enduse = ["hello", "world"], Tech = DD.Categorical(["zzz", "world"], order=DimensionalData.ForwardOrdered())))
2×2 DimArray{Float64,2} with dimensions:
Dim{:Enduse} Categorical{String} String["hello", "world"] ForwardOrdered,
Tech Categorical{String} String["zzz", "world"] ForwardOrdered
"zzz" "world"
"hello" 0.0 0.0
"world" 0.0 0.0
It makes sense that the type of the old array doesn't change but how does the new array know about the Tech
dimension type? I guess I don't quite understand how DimensionalData
is able to track which types are created as Dimensions using @dim
? Is it looking for any subtypes of a dimension with the same name?
I guess we could make Between work on Unordered ?
That would work but are you suggesting implementing functions on our end or making updates to DimensionalData.jl
?
Yeah - the Dim{:Tech}
thing is kinda confusing. Thats the default object when there is no available type, and you cant make it inherit from anything. (You can do @dim XDim Tech
to make it plot on x axes).
It knows afterwards because the macro adds methods for the internal key2dim
that dispatch on Val{:Tech}
.
Yes, kinda too magical really but I was into that stuff three years ago...
Cool! That makes sense! Feel free to close the issue as resolved.
It looks like I what I really want is to force everything to be Unordered
so that At
works always, even if it is slower. I think to use this we'd want to Between
working on Unordered
.
I'm happy to submit a PR if you can give some direction on how to go about doing this.
I guess we could make Between work on
Unordered
?It just needs a
between
function for that usingfindfirst
. Its in the selectors.jl file in the LookupArrays module.(This wasnt implemented because Between is mostly synonymous and now replaced by
..
and selecting intervals over unsorted lists is kinda weird)
I'm assuming something has to be implemented in DimensionalData.jl
for this to work, right?
julia> arr = DimArray(zeros(2,2), (Enduse = DD.Categorical(["hello", "world"], order=DD.Unordered()), Tech = DD.Categorical(["zzz", "world"], order=DD.Unordered())))
julia> arr[Tech=At("zzz"):At("world")]
ERROR: MethodError: no method matching isless(::At{String, Nothing, Nothing}, ::At{String, Nothing, Nothing})
Closest candidates are:
isless(::Any, ::Missing)
@ Base missing.jl:88
isless(::Missing, ::Any)
@ Base missing.jl:87
julia> arr[Tech=At("zzz")..At("world")]
ERROR: ArgumentError: Cannot use an interval or `Between` with Unordered
Stacktrace:
Why force Unordered? At
works fine ether way as long as the type matches the state of the lookup. (Or youre saying you dont know if its ordered?)
Between
or ..
working on Unordered is doable... Its still ordered in the sense of the lookup.
But confusion is possoble if people expect to still get indices alphabetically between their values.
Yourv At example is just ..
but not implemented. It wont need the At parts.
You can pit a vector inside At if you want that, but you have to specify everything
Why force Unordered?
Mainly because it is easier to know the types ahead of time. I'm playing around with using DimStack
which I believe would avoid the need for this, but at the moment we have a lot of code to port and test to use DimensionalData.jl and I want to experiment more before committing to using the package. Our current code uses findfirst(predicate, categorical_data)
and findall(predicate, categorical_data)
and it would be useful to force Unordered
to see if there's any performance difference. I'm expecting there would be little to none if type inference doesn't fail.
Or youre saying you dont know if its ordered?
Our users / modelers may add categorical data to existing data and it may change from ForwardOrdered
to Unordered
, so I want to be able to allow them to write code that works on either of them exposing them to ForwardOrdered
/ Unordered
/ other implementation details.
But confusion is possoble if people expect to still get indices alphabetically between their values.
I'm not exactly following what confusion would occur but I'm sure there's something you are referring to that I'm missing. Can you expand on what you think is the issue with Between
working on categorical data, where the implementation basically does findfirst(a):findlast(b)
for Between(a, b)
?
Unordered is much slower for large arrays, because binary search will beat linear search.
But for small arrays Unordered may be faster
The confusion is that with ForwardOrdered Between
or ..
have two meanings that both hold.
-
All indices with lookup values between specific values, and
-
all indices between the two outer indices with specified lookup values.
You are arguing for the secong meaning to be canonical, but someone could equally argue for the first - then Between would work like Where over the interval.
Can't we solve this by indexing with @enum
?
E.g:
@enum Idx idxfirst = 1 idxsecond = 2
A = [1,2,3,4]
Base.to_index(i::Idx) = Int(i)
... and thus...
julia> A[idxfirst]
1
You are arguing for the second meaning to be canonical, but someone could equally argue for the first - then Between would work like Where over the interval.
For categorical data, is there a difference? In categorical data, findfirst(A, idx) == findlast(A, idx)
, right?
Ordered categories are a thing, so its really no different to Sampled.
We can make this change, Im just sure I will get a bug report one day where someone expected the opposite. So we should be sure that its needed.