dataframe icon indicating copy to clipboard operation
dataframe copied to clipboard

☂️ Describe breaks on `Number` column (and other statistics inconsistencies)

Open Jolanrensen opened this issue 1 year ago • 3 comments

This happens because the Iterable<Number>.std() function accepts Number but doesn't convert them to Double (like mean() does).

There are a couple more missing actually:

  • cumSum
    • Misses Byte, Short
    • Has DataColumn overloads but not Iterable/Sequence
  • mean
    • Has Sequence<Double | Float> but not for other Number types
  • median
    • Misses Float, Byte, Short, Number (it only works on Comparable)
    • Needs to handle other types consistently
    • No Sequence overloads
    • Cannot skipNA (if applicable)
  • min and max
    • internal Iterable<T>.min and max are not used and can be removed. Stdlib functions for Comparable sequences and iterables are used instead.
    • Misses Number (it only works on Comparable)
  • std
    • Breaks if type is Number
    • Short and Byte are cast to Int which works but is a bit iffy
    • Iterable overloads missing for Number, Short, Byte
    • Sequence overloads missing
    • Nullable overloads missing for Iterable (and sequence)
  • varianceAndMean
    • also provides std(ddof: Int) function without docs of what ddof even means, as well as count. Could have a better name. Also can produce nulls?? this screams for documentation.
    • variance functions are missing on DataColumns entirely (had to be added separately for Kandy)
    • Misses Short, Byte, Number, and nullable overloads
    • Misses Sequence overloads
  • sum
    • Has TODOs where types are amiss
    • Misses Float(!), Short, Byte, Number in various Iterable overloads.

All are also missing BigInteger as we're supporting BigDecimal too.

Jolanrensen avatar Jan 12 '24 16:01 Jolanrensen

https://github.com/Kotlin/dataframe/issues/352 probably same problem

koperagen avatar Jan 15 '24 18:01 koperagen

As mentioned here https://github.com/Kotlin/dataframe/issues/543, some functions like median(ints) might result in an unexpectedly rounded Int in return. It might be better to let all functions return Double and then handle BigInteger / BigDecimal separately for now, as they're java-specific for now.

Jolanrensen avatar Jan 18 '24 11:01 Jolanrensen

It looks like an umbrella ticket and should be split to a smaller task

zaleslaw avatar Apr 23 '24 13:04 zaleslaw