scala-library-next
scala-library-next copied to clipboard
Idea: Add sortedTo (and friends) to Iterator
I feel like having a sortedTo(factory) (as well as all the other sorting variants) on Iterator would be a good addition.
It would be good to discuss if it can produce any performance advantage over iterator.to(factory).sorted or if it can be considered useful just as an API improvement to avoid two calls.
It is also worth discussing if this would be useful for all collections in general; for example, list.sortedTo(ArraySeq).
@joshlemer might have an opinion?
I think that having a fully lazy .sorted on Iterator and View would be really beneficial. I'm not sure that we'd need a .sortedTo(factory) because that optimization logic could be baked into the sorted iterator's .to method.
It would also come into handy with more than just terminal operations, if you could have a lazy sort:
case class Employee(name: String, salary: Double)
val employees: Iterable[Employee] = ???
val top3 = employees.{view, iterator}.sortBy(_.salary).take(3).toSeq
the lazy sorting would open the door to optimizing the above .sorted.take(n) example by using a min-max heap to avoid doing a lot of sorting on the source collection. In non-optimized, non-totally-correct pseudo-ish code...
self =>
def take(n: Int): Iterator = {
if (self.knownSize != -1 && self.knownSize < n) {
// nothing to be gained from heapifying. Just sort it
return Iterator.empty.concat(self.toArray.sortInPlace.iterator)
}
new Iterator[A] {
var heap: MinMaxHeap[A] = null
def hasNext: Boolean = (heap != null && heap.nonEmpty) || self.hasNext
def next(): A = {
if (heap != null) return heap.popMin()
// must force `self` now
self.foreach { a =>
if (heap.size < n) heap.addOne(a)
else {
if (a < heap.peekLast()) {
heap.popLast()
heap.addOne(a)
}
}
}
heap.popMin()
}
}
}
regardless of whether or not we have something for Iterator/View with a min-max heap, I think it would be valuable to have sortedTo (and friends), because of their:
- low conceptual weight
- simple implementation
- fast performance
while a min-max heap could drastically improve head, last, take or drop operations after a sort on an iterator, it comes with a significant performance cost if you want to use most or all of the collection. additionally, it has low/non-obvious discoverability (we already have .view.sorted.to(...) with essentially the same implementation as sortedTo(...) would have, but people don't know about it).
What about having sortedTo and sortedTake (and friends)?
I may be mistaken, but I do not think that a sorted method on Iterator would be efficient in the general sense, for example what would happen if then you chain many more transformations like map and filter? I think you would need to have an intermediate collection which is basically the point to avoid.
It seems (but I may be mistaken) that it would be easier and better to implement a lazy sort in the context of a consumption of the Iterator.
@NthPortal
low conceptual weight; simple implementation; fast performance
Just my opinion but I think it's lighter to just have .sorted than to have .sortedTo(List). We would have to have a new iterator implementation at any rate, SortedIterator[A], we can just override the def to:
class SortedIterator[A](underlying: Iterator[A]) {
// optionally, we can optimize these methods, which one cannot do if you take the `sortedTo` approach
override def take(n: Int): Iterator[A] = ???
override def drop(n: Int): Iterator[A] = ???
override def takeRight(n: Int): Iterator[A] = ???
override def dropRight(n: Int): Iterator[A] = ???
override def slice(i: Int, j: Int): Iterator[A] = ???
// whatever optimizations you would have put into sortedTo, just put in here
override def to[C](f: Factory[A, C]): C = ???
}
This is no less simple than making a new sortedTo method. It's much less intrusiv than duplicating all of the take style methods as @BalmungSan with sortedTake sortedDrop sortedTakeRight sortedDropRight sortedSlice sortedTakeWhile sortedDropWhile and whichever others. .sorted.to is no less optimizeable/performant than .sortedTo (well, save the iterator allocation I guess?)
while a min-max heap could drastically improve head, last, take or drop operations after a sort on an iterator, it comes with a significant performance cost if you want to use most or all of the collection
We could have a heuristic like "only kick in the heap once we confirm that the iterator is larger than (say, for example) 5 or 10 times n. Then we're being very conservative with the heap, only falling back to it when we know for certain that the iterator is enormous compared to n.
we already have .view.sorted.to(...) with essentially the same implementation as sortedTo(...) would have, but people don't know about it)
there isn't a .view.sorted.to(...), on anything but SeqView. I was actually going to write up some of this motivation in the recent Sorted-is-slightly-less-lazy PR but got sidetracked 😅
with
sortedTakesortedDropsortedTakeRightsortedDropRightsortedSlicesortedTakeWhilesortedDropWhileand whichever others
I'm pretty sure they just meant variants for sortWith and sortBy (i.e. sortToWith and sortToBy or something)
found a previous overlapping discussion: https://github.com/scala/bug/issues/11711
@SethTisue got me, I am realizing I've been beating this drum for ages 🤣
I am for adding lazy sort for view and iterator
there's only so lazily you can sort. at the end of the day, you need to at the very least iterate over all elements
If I understand correctly, the value of such an operation would be to be able to sort the X first elements in O(N) rather than in O(N*log(N)), right? (when X is small and in particular X << N)
The benefits would be:
- Ability to add optimization that avoids sorting most elements in situations like
View.fill(1000000)(Random.nextInt()).sorted.take(5).toList
this goes for take and friends (slice, takeRight, drop, dropRight, ...)
- avoid forcing at all when you don't need to. This is for performance but also for surprise-avoidance and consistency. The view and iterator apis are in general always lazy, so users can generally build up a tree of views and count on them not being forced until the terminal operation at the end:
val v: View[Int] = ???
View.fill(10)(1).concat(v.sorted).take(5).toList
^ Under the strict approach, v would have been forced, but it turned out we never even needed it at all.
Either way it works for me. It is not a lazy operation but I think it just conforms to the new way of collection conversion, which is using view, and avoiding adding a new API method
currently, View#sorted is only partially lazy - it computes the size eagerly to ensure that the collection is finite