ComplexityMeasures.jl icon indicating copy to clipboard operation
ComplexityMeasures.jl copied to clipboard

Rank-based outcome space

Open kahaaga opened this issue 6 months ago • 4 comments

Describe the feature you'd like to have

While implementing the Chatterjee coefficient of correlation in Associations.jl, I needed to compute ranks of an input vector, according to some method. That made me think: why not have an encoding and an outcome space in ComplexityMeasures.jl that does the same? I imagine something like this:

The encoding takes as an input the input data, which allows a mapping encoding the raw values xi onto an index idx_i, and decode that index onto xi. If length(x) == length(rank(x)), then there are as many outcomes as there are values in x (so the outcome space is a bit useless for complexity quantification). However, in datasets with repetitions, we can have length(rank(x)) << length(x). Then each rank index is an outcome, and we meaningfully estimate counts/probabilities by counting how many input data points map to each rank index.

Cite scientific papers related to the feature/algorithm

I have no references atm. This may or may not have been done by someone before.

If possible, sketch out an implementation strategy

Relatively straight-forward: follow the dev docs on implementing the encoding and outcome space.

A few notes:

  • The encoding must store the raw data, so that we can both and encode/decode.
  • There are many possible rank methods. The user can control which method to use by applying keyword argument to the encoding/outcome space (probably most elegant with dispatch on a T <: RankType). StatsBase.jl implements a few such ranking methods we can use for inspiration. I also implemented a randomization-tie-breaking variant for the Chatterjee coefficient that we would use.

kahaaga avatar Jul 31 '24 06:07 kahaaga