ComplexityMeasures.jl
ComplexityMeasures.jl copied to clipboard
Rank-based outcome space
Describe the feature you'd like to have
While implementing the Chatterjee coefficient of correlation in Associations.jl, I needed to compute ranks of an input vector, according to some method. That made me think: why not have an encoding and an outcome space in ComplexityMeasures.jl that does the same? I imagine something like this:
The encoding takes as an input the input data, which allows a mapping encoding the raw values xi
onto an index idx_i
, and decode that index onto xi
. If length(x) == length(rank(x))
, then there are as many outcomes as there are values in x
(so the outcome space is a bit useless for complexity quantification). However, in datasets with repetitions, we can have length(rank(x)) << length(x)
. Then each rank index is an outcome, and we meaningfully estimate counts/probabilities by counting how many input data points map to each rank index.
Cite scientific papers related to the feature/algorithm
I have no references atm. This may or may not have been done by someone before.
If possible, sketch out an implementation strategy
Relatively straight-forward: follow the dev docs on implementing the encoding and outcome space.
A few notes:
- The encoding must store the raw data, so that we can both and encode/decode.
- There are many possible rank methods. The user can control which method to use by applying keyword argument to the encoding/outcome space (probably most elegant with dispatch on a
T <: RankType
). StatsBase.jl implements a few such ranking methods we can use for inspiration. I also implemented a randomization-tie-breaking variant for the Chatterjee coefficient that we would use.