ComplexityMeasures.jl
ComplexityMeasures.jl copied to clipboard
Add generic histogram method for `Vector{<:Any}`, so we can use `counts(::UniqueElements, x)` on any data
Currently, we can do
julia> x = rand([1, 4, 6, 7, 8], 1000)
julia> counts(x)
Counts{Int64,1} over 5 outcomes
1 199
4 215
6 197
7 180
8 209
But this only works if x
is sortable, because fasthist!
uses sorting internally. We should have a generic (probably much slower) fallback that enables things like this to work too:
julia> y = ['a', (1, 2, 3,), 25];
julia> counts(y)
ERROR: MethodError: no method matching isless(::Tuple{Int64, Int64, Int64}, ::Char)
Closest candidates are:
isless(::Char, ::Char)
@ Base char.jl:214
isless(::AbstractChar, ::AbstractChar)
@ Base char.jl:221
isless(::Tuple, ::Tuple{})
@ Base tuple.jl:532
...
Stacktrace:
[1] lt(o::Base.Order.ForwardOrdering, a::Tuple{Int64, Int64, Int64}, b::Char)
@ Base.Order ./ordering.jl:117
[2] _sort!(v::Vector{Any}, #unused#::Base.Sort.InsertionSortAlg, o::Base.Order.ForwardOrdering, kw::NamedTuple{(:scratch, :lo, :hi), Tuple{Nothing, Int64, Int64}})
@ Base.Sort ./sort.jl:748
[3] _sort!
@ ./sort.jl:713 [inlined]
[4] _sort!
@ ./sort.jl:660 [inlined]
[5] _sort!
@ ./sort.jl:596 [inlined]
[6] #sort!#28
@ ./sort.jl:1374 [inlined]
[7] sort!
@ ./sort.jl:1367 [inlined]
[8] fasthist!(x::Vector{Any})
@ ComplexityMeasures ~/Documents/Repos/ComplexityMeasures.jl/src/encoding_implementations/fasthist.jl:21
[9] counts_and_outcomes
@ ~/Documents/Repos/ComplexityMeasures.jl/src/core/counts.jl:99 [inlined]
[10] counts(x::Vector{Any})
@ ComplexityMeasures ~/Documents/Repos/ComplexityMeasures.jl/src/core/counts.jl:121
[11] top-level scope
@ REPL[42]:1
Hm. Not entirely sure about this. Should we instead require the input to be sortable (i.e. the user has to map their data to some sortable format)? That would make maintenance a bit easier.
I think in the future there should be a version implemented for non-sortable collections. But this is extremely low priority because a user can always map their vector to the integers corresponding to unique elements and then use count on the integer. For now we can focus on v3 and hope that some other contributor will add this method in the future.
I think in the future there should be a version implemented for non-sortable collections. But this is extremely low priority because a user can always map their vector to the integers corresponding to unique elements and then use count on the integer.
Agreed. Then I keep this issue open.
There is a difficulty I see in implementing this: there is no julia function, or type trait, that checks for sortability. So it is not clear to me how we would check whether we can call fasthist or not in the first place...
We could check something like
julia> hasmethod(isless, (T, T))
false
with T
the element type of the vector.
We could check something like
julia> hasmethod(isless, (T, T)) false
with
T
the element type of the vector.
Yes, this should work. The fallback would typically just occur when the user has different types of elements in their input vector. Then T
becomes Any
and
julia> T = Any
Any
julia> hasmethod(isless, (T, T))
false
In the current API, the docstring of outcome_space
is
"""
outcome_space(o::OutcomeSpace, x) → Ω
Return a sorted container containing all _possible_ outcomes of `o` for input `x`.
For some estimators the concrete outcome space is known without knowledge of input `x`,
in which case the function dispatches to `outcome_space(o)`.
In general it is recommended to use the 2-argument version irrespectively of estimator.
"""
function outcome_space(o::OutcomeSpace)
For a non-sortable vector, we can't sort the elements. Therefore, we either need to relax this requirement, or disallow using non-sortable data (which I think is too restrictive atm). We could just say that "if your input data isn't sortable, then the outcome space is given in the order of first appearance of the outcomes" or something like that
Yes order of first appearance is fine, but at the moment non sortable data simply domm't work with the pacakge ayways
I haven't done any sort of testing for speed etc, but an obvious (and perhaps the easiest) solution to this issue is precisely what @Datseris says above - convert to the integers (and then convert back again). We already have the infrastructure to do this conversion, in the form of UniqueElementsEncoding
.
- Make a method
is_sortable(x::AbstractVector{T}) where T = hasmethod(isless, (T, T))
- In all instances where
fasthist!(x, args...)
is called, instead call and intermediate functionfasthist_checksorted!(x, args...)
. It does the following: a. Checks if the inputx
is sortable. b. Ifx
is sortable, then usefasthist!
directly. c. If the input is not sortable, then create a new vectorx_ints
that map the unique elements ofx
to unique integers. To do so, useUniqueElementsEncoding
onx
and useencode
to map the elements ofx
onto the integers. d. Then passx_ints
tofasthist!
and obtain the histogram. e. Usedecode
to replace each element ofx_ints
with the original input. f. Optionally optimizecounts_and_outcomes(::UniqueElements, x)
by not replacing every element inx_ints
. Becausefasthist!
gives a histogram over the unique elements ofx_int
, we can get away with only decoding the unique elements inx_ints
.
In any case, anyone working on this should wait until #347 is merged.