rfcs icon indicating copy to clipboard operation
rfcs copied to clipboard

Add RFC for SortBy

Open damon-kwok opened this issue 4 years ago • 25 comments

This PR adds my RFC for SortBy primitive.

damon-kwok avatar Jul 16 '20 01:07 damon-kwok

@damon-kwok can you expand the motivation to explain why, given the existence of Sort that this would be valuable?

Additionally as part the detailed design, you should note there that is an existing implementation at X location. That information is currently hidden in the "How we test".

SeanTAllen avatar Jul 16 '20 01:07 SeanTAllen

@SeanTAllen Updated.

damon-kwok avatar Jul 16 '20 02:07 damon-kwok

@damon-kwok "is not beginner friendly" is a statement. You need as part of your motivation to demonstrate how it isn't beginner friendly and how SortBy is.

SeanTAllen avatar Jul 16 '20 02:07 SeanTAllen

@damon-kwok I think the second argument is really the strong point here. SortBy would let you sort objects that don't implement Comparable. Additionally, it lets you define arbitrary sort orderings so that you aren't constrained by the one chosen by Comparable.

"Beginner friendliness" is kind of arbitrary and may not be a point that you really want to try to argue.

aturley avatar Jul 16 '20 03:07 aturley

@aturley Yes, i think you are correct.

damon-kwok avatar Jul 16 '20 03:07 damon-kwok

Copying my comments from the original Zulip thread:

My one qualm is the comparator signature being {(B): USize} whereas you'd usually have {(B, B): ISize} in other languages, and return a numeric value that represents the order of the two elements (-1, 0, or 1)

The former forces you to think of how to assign a numeric value for every element, which doesn't make sense for example if you're sorting strings alphabetically

Another possibility would be {(B, B): Bool} representing a gt implementation. This could also use the interface Comparable methods as a default.

And one example that more closely matches the collections library (ComparableSort would be to Sort what HashMap is to Map, for example):

https://playground.ponylang.io/?gist=4fe7e1b7403332e25a997ad29e482ea7

EpicEric avatar Jul 16 '20 04:07 EpicEric

Copying zulip conversation context:

04:55 DamonKwok I think this may be complicated. My idea comes from Hashable, it only needs a fun hash (): USize. so, a lambda is work. Because a number already has all the interfaces you need.

04:58 Epic Eric Yeah I had to copy all methods but you could probably do with a single comparator like lt

damon-kwok avatar Jul 16 '20 04:07 damon-kwok

I agree with the statements above from @EpicEric - the lambda should be given the elements to compare. Coming up with an object -> integer mapping is not always feasible for the things I may want to sort.

Apart from the specific concerns that @EpicEric mentioned above, I want to add more concerns into the mix:

Imagine I have an object that I want to sort based on comparing multiple fields. In order to convert them to a USize, I'd have to first convert each one to a numeric representation, then do some bitshifting and bitmasking magic to make all of those numeric values fit into the USize in the correct order. This is more complicated and in many cases probably less performant because I'd have to eagerly consider all fields whereas a compare lambda would let me return early as soon as one of the earlier fields was found to be greater/lesser, and only consider later fields in the rarer case that all of the earlier fields are equal.

Further, this approach of converting all fields of the object into a numeric representation and squeezing them into a USize will not be physically possible for some objects, depending on the sizes (and relevant information density) of those fields.

To use a real world example from code I have written, consider a Time class that is implemented with two fields:

  • total_seconds: U64 - The total number of seconds since 0001-01-01 00:00:00.0 UTC
  • nanosecond: U32 - The number of nanoseconds past the designated second

There are 96 bits of information to consider for sorting purposes, and I'll never be able to fit all of those bits into a USize, which is 64 bits on some platforms and 32 bits on others. And that's an example with only two fields.

jemc avatar Jul 16 '20 13:07 jemc

@damon-kwok what "real world" code can you point to that is using this SortBy? I think knowing that could help pinpoint additional tests, usage patterns that should be experimented with before this would be considered for the standard library.

SeanTAllen avatar Jul 16 '20 13:07 SeanTAllen

@jemc If you have an extreme case that Usize cannot represent, that's not what SortBy is dealing with. SortBy focuses on conditional evaluations rather than solving all problems. See:

#Drawbacks The user must ensure that the lambda evaluation function is valid.

damon-kwok avatar Jul 16 '20 17:07 damon-kwok

@SeanTAllen Many real world cases:

Leveldb manager:

  • sort by key length
  • sort by memory used
  • sort by recent update

A blog system,

  • sort by create date
  • sort by last update time.

A set of JSON data representing the product information

sort by storage date sort by inventory number

damon-kwok avatar Jul 16 '20 17:07 damon-kwok

Focusing on existing types without compareable implementations. The comparison problem of complex types is reduced to a number comparison problem. This is what Sortby is going to do.

damon-kwok avatar Jul 16 '20 17:07 damon-kwok

I agree with @EpicEric and @jemc; I would prefer a lambda that takes two objects of the given type.

chalcolith avatar Jul 16 '20 18:07 chalcolith

But that's not what I want to do. Maybe someone should add a new RFC for it.

damon-kwok avatar Jul 16 '20 18:07 damon-kwok

I named it SortBy rather than any other name, hoping that it would be as simple and elegant as possible, and that no new types should be added for sorting. Otherwise it should be called something else.

damon-kwok avatar Jul 16 '20 18:07 damon-kwok

With a generic lambda approach like the one I showed above, adding a SortBy structure can be done in 10 lines:

primitive SortBy[A: Seq[B] ref, B: Any #read]
  fun apply(seq: A, f: {(B): USize} val): A^ =>
    Sort[A, B, SortFunction[B]](
      seq,
      object val is SortFunction[B]
        fun lt(first: B, second: B): Bool => f(first) < f(second)
        fun le(first: B, second: B): Bool => f(first) <= f(second)
        fun gt(first: B, second: B): Bool => f(first) > f(second)
        fun ge(first: B, second: B): Bool => f(first) >= f(second)
      end)

actor Main
  new create(env:Env) =>
    let array = [ "aa"; "aaa"; "a" ]
    SortBy[Array[String], String](array, {(x: String): USize => x.size() })
    for e in array.values() do
      env.out.print(e) // prints "a \n aa \n aaa"
    end

https://playground.ponylang.io/?gist=144512d4a8417b60452e3084b4c3f32d


As a side note: due to the need for a concrete type, you can't avoid adding the Array[String] type parameter in this case, matching the current Sort type. So either way, the RFC must be fixed to take that into account.

EpicEric avatar Jul 16 '20 19:07 EpicEric

@EpicEric I have a few questions:

  1. How does the extra 10 lines of code solve the problems raised by @jemc ?
  2. Your SortBy seems to do the same thing as my SortBy, but it requires the overhead of 2 Object Literals?
  3. I don't quite know your intention. My SortBy is a new primitive, it will not interfere with the existing Sort, but you seem to want to modify Sort's implementation. Is that right?

damon-kwok avatar Jul 17 '20 01:07 damon-kwok

@EpicEric Can you create a temp git repository and submit a sort*.pony file and a _test.pony file that can be used by others? This can help others more clearly compare the difference between the two.

damon-kwok avatar Jul 17 '20 01:07 damon-kwok

  1. How does the extra 10 lines of code solve the problems raised by @jemc ?

It doesn't. You'd create a sort function using the Sort implementation directly instead, without using SortBy.

primitive TimeSortFunction is SortFunction[Time]
  fun lt(first: Time, second: Time): Bool =>
    (first.total_seconds < second.total_seconds) || (
      (first.total_seconds == second.total_seconds)
        && (first.nanosecond < second.nanosecond))

  fun gt(first: Time, second: Time): Bool =>
    (first.total_seconds > second.total_seconds) || (
      (first.total_seconds == second.total_seconds)
        && first.nanosecond > second.nanosecond))

  fun le(first: Time, second: Time): Bool =>
    (first.total_seconds < second.total_seconds) || (
      (first.total_seconds == second.total_seconds)
        && first.nanosecond <= second.nanosecond))

  fun ge(first: Time, second: Time): Bool =>
    (first.total_seconds > second.total_seconds) || (
      (first.total_seconds == second.total_seconds)
        && (first.nanosecond >= second.nanosecond))

primitive TimeSort[A: Seq[Time] ref]
  fun apply(a: A): A^ =>
    Sort[A, Time, TimeSortFunction](a, TimeSort)
  1. Your SortBy seems to do the same thing as my SortBy, but it requires the overhead of 2 Object Literals?

This RFC doesn't have any code implementation, so I'm assuming you refer to this code. In this case, yes, there is one extra anonymous object allocation.

  1. I don't quite know your intention. My SortBy is a new primitive, it will not interfere with the existing Sort, but you seem to want to modify Sort's implementation. Is that right?

This is only if API names have to follow the same convention as the rest of the collections package. Compatibility could be maintained if Sort and ComparableSort are renamed to, say, RawSort and Sort respectively.

EpicEric avatar Jul 17 '20 13:07 EpicEric

@EpicEric Thanks! I see. I know the process of adding RFC is long and we can listen to other people's ideas.

damon-kwok avatar Jul 17 '20 14:07 damon-kwok

Do you still want to add this RFC?

While this may be a reasonable addition, I feel that it violates least surprise as it exists. Specifically:

  • Most of the comments here seemed to have expected from the name that it would be a comparison function
  • The comparable interface is based on comparison functions, rather than valuations
  • This is fixed to be USize, rather than any other type (and worse, the information available in USize is system-dependent)

It does not appear to me that a comparison-based solution has a very significant difference in allocation. If it just requires a single method (such as compare), then the allocation characteristics will be identical to a lambda returning USize. Of course as with HashFunctions we can have a generic primitive to dispatch without allocation in the general case

jasoncarr0 avatar Aug 22 '20 18:08 jasoncarr0

@jasoncarr0 Do you mean to change USize to U128 or I128?

damon-kwok avatar Aug 23 '20 04:08 damon-kwok

I would actually expect U64. The issue with USize is that the maximum value depends on the system it runs on.

jasoncarr0 avatar Aug 23 '20 05:08 jasoncarr0

@jasoncarr0 Updated.

damon-kwok avatar Aug 23 '20 06:08 damon-kwok

I agree with the commenters that I would expect something called SortBy to use a comparator. If that's not possible it should at least use a key which can be complex (e.g. tuple) and support lexicographic sorting.

Here's a survey of languages sorting support

  • JavaScript
  • C++
    • Offers sorting with comparison functions via std::sort
  • Python
    • Sorting is stable for partial orderings and lexicographic for tuples
    • Sort using key available (since Py2.4) where the key can be tuples
    • Before Py3.0, sort offered comparison functions using cmp parameter
  • Java
    • Allows simple sorting and defining a sorting for a type via Comparable
    • Also allows custom sorts to be provided using the Comparator class
  • Rust
    • Offers sorting and searching based on trait std::cmp::Ord
    • Allows sorting of slices using sort_by which takes a comparator.
    • The newtype zero cost abstraction can be used to wrap a type with a new sorting behavior.

esoterra avatar Feb 28 '21 00:02 esoterra