jOOL icon indicating copy to clipboard operation
jOOL copied to clipboard

Add support for PERCENTILE_CONT semantics

Open rwperrott opened this issue 4 years ago • 6 comments

Expected behavior and actual behavior:

Re: Median definition. See code examples below.

The Agg.percentileBy method incorrectly assumes that index rounding is enough, but isn't when the the (size * percentile) still has an effectively non-zero factional part, and the floor indexed value and the next value are not equal.

I appreciate that not all Comparable types can be, or easily, added and divided, so I suggest that a modified percentile method, and median method, be added, accepting a type specific function, with parameters for the floor indexed value, the next indexed value and the fractional part of the index, to allow returning a more correct value. The function would only need to be called when the index has an effectively non-zero fractional part and the relevant indexed values are different. Any such new functionally should also exposed anywhere else the existing median and percentile* methods are also exposed e.g. Window.

Steps to reproduce the problem:

e.g. for median

Seq.of(1d,2d,3d,4d,5d,6d,7d).median().ifPresent(System.out::println);

prints 4.0, which is correct, because the exact middle value is 4.0,

Seq.of(1d,2d,3d,4d,5d,6d,7d,8d).median().ifPresent(System.out::println);

prints 4.0 which is apparently incorrect, because the middle values are 4 and 5, so should print (4 + 5) / 2 = 4.5

Versions:

  • jOOλ: 0.9.14
  • Java: 15

rwperrott avatar Jan 02 '21 03:01 rwperrott

I'm trying out my suggested solution. I think that a function like this could be passed into a modified copy of Agg.percentileBy, to cover ArrayList<Tuple2<T,U>> value storage, and later ArrayList<T> optimised value storage for detected identity function:

import java.util.function.Function;

@FunctionalInterface
public interface PercentileFunction<T,U> {
    T apply(T t0, // value at index before percentile position
            T t1, // next indexed value
            double indexFraction,
            Function<? super T, ? extends U> function); // to avoid need for two functions, for Tuple2<T,U> and T values
}

rwperrott avatar Jan 02 '21 14:01 rwperrott

I note that the https://infogalactic.com/info/Percentile does mention that linear interpolation is an option, just-like https://infogalactic.com/info/Median, for the median value of sorted, even length, numeric values.

rwperrott avatar Jan 03 '21 10:01 rwperrott

This seems to work, with the above function:

// imports
public class Agg2 {
    public static <T, U> Collector<T, ?, Optional<T>>
    percentileBy(double percentile,
                 Function<? super T, ? extends U> mapper,
                 Comparator<? super U> comparator,
                 PercentileFunction<T, U> percentileFunction) {
        if (percentile < 0.0 || percentile > 1.0)
            throw new IllegalArgumentException("Percentile must be between 0.0 and 1.0");

        return Collector.of(
                (Supplier<ArrayList<Tuple2<T, U>>>) ArrayList::new,
                (l, v) -> l.add(tuple(v, mapper.apply(v))),
                (l1, l2) -> {
                    l1.addAll(l2);
                    return l1;
                },
                l -> {
                    final int size = l.size();

                    if (size == 0)
                        return Optional.empty();
                    else if (size == 1)
                        return Optional.of(l.get(0).v1);

                    l.sort(Comparator.comparing(t -> t.v2, comparator)); // Compare U

                    if (percentile == 0d)
                        return Optional.of(l.get(0).v1);
                    if (percentile == 1d)
                        return Optional.of(l.get(size - 1).v1);

                    // Limit fraction size, to stop common errors for double percentile values e.g. 2E-16.
                    // 0.5 is added because actual percentile value can be between values.
                    final double dIndex = ((double) Math.round(size * percentile * 1.0E6d) * 1.0E-6d) - 0.5d;
                    int index = (int) dIndex; // floor, for before or exact index
                    if (index >= size)
                        return Optional.of(l.get(size - 1).v1);

                    final Tuple2<T, U> t0 = l.get(index); // before or exact value
                    final double indexFraction = dIndex - index;
                    // If end or exact index, return t0 value.
                    if (++index == size || indexFraction == 0d)
                        return Optional.of(t0.v1);

                    final Tuple2<T, U> t1 = l.get(index); // after value
                    // Only call percentile function if t*.v1 values are different.
                    return (t0.v1.equals(t1.v1))
                           ? Optional.of(t0.v1)
                           : Optional.of(percentileFunction.apply(t0.v1, t1.v1, indexFraction, mapper));
                });
}

Test code

class Test {
    public void main(String[] args) {
            final Double[] values = {0d, 10d, 20d, 30d, 40d, 50d, 60d, 70d, 80d, 90d};

            final PercentileFunction<Double, Double> floor = (t0, t1, f, m) -> {
                double v0 = m.apply(t0);
                double v1 = m.apply(t1);
                return v0;
            };

            final PercentileFunction<Double, Double> ceil = (t0, t1, f, m) -> {
                double v0 = m.apply(t0);
                double v1 = m.apply(t1);
                return v1;
            };

            final PercentileFunction<Double, Double> halfUp = (t0, t1, f, m) -> {
                double v0 = m.apply(t0);
                double v1 = m.apply(t1);
                return f < 0.5d ? v0 : v1;
            };

            final PercentileFunction<Double, Double> interpolate = (t0, t1, f, m) -> {
                double v0 = m.apply(t0);
                double v1 = m.apply(t1);
                return v0 - (v0 * f) + (v1 * f); // Linear interpolation
            };

            final String header = "percentile -> |  Agg | floor | halfUp | interpolate | ceil";
            System.out.println(header);
            for (double p = 0d; p <= 1.00d; p += 0.05d) {
                System.out.printf("   %5.3f   -> | %4.1f |  %4.1f |   %4.1f |    %4.1f     | %4.1f%n",
                                  p,
                                 Seq.of(values).percentile(p).orElseThrow(),
                                 Seq.of(values).collect(Agg2.percentile(p, Comparator.naturalOrder(), floor)).orElseThrow(),
                                 Seq.of(values).collect(Agg2.percentile(p, Comparator.naturalOrder(), halfUp)).orElseThrow(),
                                 Seq.of(values).collect(Agg2.percentile(p, Comparator.naturalOrder(), interpolate)).orElseThrow(),
                                 Seq.of(values).collect(Agg2.percentile(p, Comparator.naturalOrder(), ceil)).orElseThrow());
            }
            System.out.println(header);
    }
}

Results:

percentile -> |  Agg | floor | halfUp | interpolate | ceil
   0.000   -> |  0.0 |   0.0 |    0.0 |     0.0     |  0.0
   0.050   -> |  0.0 |   0.0 |    0.0 |     0.0     |  0.0
   0.100   -> |  0.0 |   0.0 |   10.0 |     5.0     | 10.0
   0.150   -> | 10.0 |  10.0 |   10.0 |    10.0     | 10.0
   0.200   -> | 10.0 |  10.0 |   20.0 |    15.0     | 20.0
   0.250   -> | 20.0 |  20.0 |   20.0 |    20.0     | 20.0
   0.300   -> | 20.0 |  20.0 |   30.0 |    25.0     | 30.0
   0.350   -> | 30.0 |  30.0 |   30.0 |    30.0     | 30.0
   0.400   -> | 30.0 |  30.0 |   40.0 |    35.0     | 40.0
   0.450   -> | 40.0 |  40.0 |   40.0 |    40.0     | 40.0
   0.500   -> | 40.0 |  40.0 |   50.0 |    45.0     | 50.0
   0.550   -> | 50.0 |  50.0 |   50.0 |    50.0     | 50.0
   0.600   -> | 50.0 |  50.0 |   60.0 |    55.0     | 60.0
   0.650   -> | 60.0 |  60.0 |   60.0 |    60.0     | 60.0
   0.700   -> | 70.0 |  60.0 |   70.0 |    65.0     | 70.0
   0.750   -> | 70.0 |  70.0 |   70.0 |    70.0     | 70.0
   0.800   -> | 80.0 |  70.0 |   80.0 |    75.0     | 80.0
   0.850   -> | 80.0 |  80.0 |   80.0 |    80.0     | 80.0
   0.900   -> | 90.0 |  80.0 |   90.0 |    85.0     | 90.0
   0.950   -> | 90.0 |  90.0 |   90.0 |    90.0     | 90.0
percentile -> |  Agg | floor | halfUp | interpolate | ceil

rwperrott avatar Jan 06 '21 15:01 rwperrott

Thanks a lot for your analysis, @rwperrott. I'll look into this next week.

lukaseder avatar Jan 08 '21 16:01 lukaseder

The current implementations implement the equivalent of SQL's PERCENTILE_DISC inverse distribution function, not PERCENTILE_CONT. This is probably not the correct default for MEDIAN, indeed, but for the other ones, it's correct by design and Javadoc.

Of course, we should offer PERCENTILE_CONT as well, as that is probably used more often in real world applications.

lukaseder avatar Jan 12 '21 08:01 lukaseder

The function parameter in PercentileFunction, thus U, now look redundant because function is unlikely to be useful for percentileBy operations, so can be simplified to this:

public interface PercentileFunction<T> {
    T apply(T t0,
            T t1,
            double indexFraction); // mapper proved to be redundant!

    static <T> PercentileFunction<T> floor() {
        return (t0, t1, f) -> t0;
    }

    static <T> PercentileFunction<T> ceil() {
        return (t0, t1, f) -> t1;
    }

    static <T> PercentileFunction<T> halfUp() {
        return (t0, t1, f) -> f < 0.5d ? t0 : t1;
    }

    static PercentileFunction<Double> interpolateDouble() {
        return (t0, t1, f) -> t0 - (t0 * f) + (t1 * f);
    }
}

It is pointless collecting Tuple2<T,U> just to get a U result, so a list collection only needs to collect U, and be wrapped in collector which maps T to U.

And example classes (not tested):

// imports
public class Agg2 {
    public static <T, U> Collector<T, ?, Optional<T>>
    percentileBy(double percentile,
                 Function<? super T, ? extends U> mapper,
                 Comparator<? super U> comparator,
                 PercentileFunction<T> percentileFunction) {
        if (percentile < 0.0 || percentile > 1.0)
            throw new IllegalArgumentException("Percentile must be between 0.0 and 1.0");

        return Collector.of(
                (Supplier<ArrayList<Tuple2<T, U>>>) ArrayList::new,
                (l, v) -> l.add(tuple(v, mapper.apply(v))),
                (l1, l2) -> {
                    l1.addAll(l2);
                    return l1;
                },
                l -> {
                    final int size = l.size();

                    if (size == 0)
                        return Optional.empty();
                    else if (size == 1)
                        return Optional.of(l.get(0).v1);

                    l.sort(Comparator.comparing(t -> t.v2, comparator)); // Compare U

                    if (percentile == 0d)
                        return Optional.of(l.get(0).v1);
                    if (percentile == 1d)
                        return Optional.of(l.get(size - 1).v1);

                    // Limit fraction size, to stop common errors for double percentile values e.g. 2E-16.
                    // 0.5 is added because actual percentile value can be between values.
                    final double dIndex = ((double) Math.round(size * percentile * 1.0E6d) * 1.0E-6d) - 0.5d;
                    int index = (int) dIndex; // floor, for before or exact index
                    if (index >= size)
                        return Optional.of(l.get(size - 1).v1);

                    final Tuple2<T, U> t0 = l.get(index); // before or exact value
                    final double indexFraction = dIndex - index;
                    // If end or exact index, return t0 value.
                    if (++index == size || indexFraction == 0d)
                        return Optional.of(t0.v1);

                    final Tuple2<T, U> t1 = l.get(index); // after value
                    // Only call percentile function if t*.v1 values are different.
                    return (t0.v1.equals(t1.v1))
                           ? Optional.of(t0.v1)
                           : Optional.of(percentileFunction.apply(t0.v1, t1.v1, indexFraction));
                });
}
import static PercentileFunction.*; // for static PercentileFunction provider methods.

class Test {
    public void main(String[] args) {
            final Double[] values = {0d, 10d, 20d, 30d, 40d, 50d, 60d, 70d, 80d, 90d};

            final String header = "percentile -> |  Agg | floor | halfUp | interpolate | ceil";
            System.out.println(header);
            for (double p = 0d; p <= 1.00d; p += 0.05d) {
                System.out.printf("   %5.3f   -> | %4.1f |  %4.1f |   %4.1f |    %4.1f     | %4.1f%n",
                                  p,
                                 Seq.of(values).percentile(p).orElseThrow(),
                                 Seq.of(values).collect(Agg2.percentile(p, Comparator.naturalOrder(), floor())).orElseThrow(),
                                 Seq.of(values).collect(Agg2.percentile(p, Comparator.naturalOrder(), halfUp())).orElseThrow(),
                                 Seq.of(values).collect(Agg2.percentile(p, Comparator.naturalOrder(), interpolateDouble())).orElseThrow(),
                                 Seq.of(values).collect(Agg2.percentile(p, Comparator.naturalOrder(), ceil())).orElseThrow());
            }
            System.out.println(header);
    }
}

rwperrott avatar Jan 28 '21 12:01 rwperrott