scala-library-next icon indicating copy to clipboard operation
scala-library-next copied to clipboard

Idea: Add sortedTo (and friends) to Iterator

Open BalmungSan opened this issue 5 years ago • 15 comments

I feel like having a sortedTo(factory) (as well as all the other sorting variants) on Iterator would be a good addition.

It would be good to discuss if it can produce any performance advantage over iterator.to(factory).sorted or if it can be considered useful just as an API improvement to avoid two calls. It is also worth discussing if this would be useful for all collections in general; for example, list.sortedTo(ArraySeq).

BalmungSan avatar Oct 23 '20 22:10 BalmungSan

@joshlemer might have an opinion?

SethTisue avatar Oct 23 '20 23:10 SethTisue

I think that having a fully lazy .sorted on Iterator and View would be really beneficial. I'm not sure that we'd need a .sortedTo(factory) because that optimization logic could be baked into the sorted iterator's .to method.

It would also come into handy with more than just terminal operations, if you could have a lazy sort:

case class Employee(name: String, salary: Double)

val employees: Iterable[Employee] = ???

val top3 = employees.{view, iterator}.sortBy(_.salary).take(3).toSeq

joshlemer avatar Oct 24 '20 14:10 joshlemer

the lazy sorting would open the door to optimizing the above .sorted.take(n) example by using a min-max heap to avoid doing a lot of sorting on the source collection. In non-optimized, non-totally-correct pseudo-ish code...

self => 

def take(n: Int): Iterator = {
  if (self.knownSize != -1 && self.knownSize < n) {
    // nothing to be gained from heapifying. Just sort it
    return Iterator.empty.concat(self.toArray.sortInPlace.iterator)
  }
  
  new Iterator[A] {
    var heap: MinMaxHeap[A] = null
    def hasNext: Boolean = (heap != null && heap.nonEmpty) || self.hasNext
    def next(): A = {
      if (heap != null) return heap.popMin()
      
      // must force `self` now
      self.foreach { a => 
         if (heap.size < n) heap.addOne(a)
         else {
           if (a < heap.peekLast()) {
             heap.popLast()
             heap.addOne(a)
           }
         }
      }

       heap.popMin()
    }
   
  }
}

joshlemer avatar Oct 24 '20 15:10 joshlemer

regardless of whether or not we have something for Iterator/View with a min-max heap, I think it would be valuable to have sortedTo (and friends), because of their:

  • low conceptual weight
  • simple implementation
  • fast performance

while a min-max heap could drastically improve head, last, take or drop operations after a sort on an iterator, it comes with a significant performance cost if you want to use most or all of the collection. additionally, it has low/non-obvious discoverability (we already have .view.sorted.to(...) with essentially the same implementation as sortedTo(...) would have, but people don't know about it).

NthPortal avatar Oct 24 '20 18:10 NthPortal

What about having sortedTo and sortedTake (and friends)?

I may be mistaken, but I do not think that a sorted method on Iterator would be efficient in the general sense, for example what would happen if then you chain many more transformations like map and filter? I think you would need to have an intermediate collection which is basically the point to avoid.

It seems (but I may be mistaken) that it would be easier and better to implement a lazy sort in the context of a consumption of the Iterator.

BalmungSan avatar Oct 24 '20 19:10 BalmungSan

@NthPortal

low conceptual weight; simple implementation; fast performance

Just my opinion but I think it's lighter to just have .sorted than to have .sortedTo(List). We would have to have a new iterator implementation at any rate, SortedIterator[A], we can just override the def to:

class SortedIterator[A](underlying: Iterator[A]) { 

  // optionally, we can optimize these methods, which one cannot do if you take the `sortedTo` approach 

  override def take(n: Int): Iterator[A] = ???
  override def drop(n: Int): Iterator[A] = ???
  override def takeRight(n: Int): Iterator[A] = ???
  override def dropRight(n: Int): Iterator[A] = ???
  override def slice(i: Int, j: Int): Iterator[A] = ???


  // whatever optimizations you would have put into sortedTo, just put in here
 override def to[C](f: Factory[A, C]): C = ???
}

This is no less simple than making a new sortedTo method. It's much less intrusiv than duplicating all of the take style methods as @BalmungSan with sortedTake sortedDrop sortedTakeRight sortedDropRight sortedSlice sortedTakeWhile sortedDropWhile and whichever others. .sorted.to is no less optimizeable/performant than .sortedTo (well, save the iterator allocation I guess?)

while a min-max heap could drastically improve head, last, take or drop operations after a sort on an iterator, it comes with a significant performance cost if you want to use most or all of the collection

We could have a heuristic like "only kick in the heap once we confirm that the iterator is larger than (say, for example) 5 or 10 times n. Then we're being very conservative with the heap, only falling back to it when we know for certain that the iterator is enormous compared to n.

we already have .view.sorted.to(...) with essentially the same implementation as sortedTo(...) would have, but people don't know about it)

there isn't a .view.sorted.to(...), on anything but SeqView. I was actually going to write up some of this motivation in the recent Sorted-is-slightly-less-lazy PR but got sidetracked 😅

joshlemer avatar Oct 24 '20 20:10 joshlemer

with sortedTake sortedDrop sortedTakeRight sortedDropRight sortedSlice sortedTakeWhile sortedDropWhile and whichever others

I'm pretty sure they just meant variants for sortWith and sortBy (i.e. sortToWith and sortToBy or something)

NthPortal avatar Oct 24 '20 20:10 NthPortal

found a previous overlapping discussion: https://github.com/scala/bug/issues/11711

SethTisue avatar Oct 24 '20 21:10 SethTisue

@SethTisue got me, I am realizing I've been beating this drum for ages 🤣

joshlemer avatar Oct 24 '20 21:10 joshlemer

I am for adding lazy sort for view and iterator

neontorrent avatar Jan 21 '21 01:01 neontorrent

there's only so lazily you can sort. at the end of the day, you need to at the very least iterate over all elements

NthPortal avatar Jan 21 '21 05:01 NthPortal

If I understand correctly, the value of such an operation would be to be able to sort the X first elements in O(N) rather than in O(N*log(N)), right? (when X is small and in particular X << N)

julienrf avatar Jan 21 '21 09:01 julienrf

The benefits would be:

  1. Ability to add optimization that avoids sorting most elements in situations like
View.fill(1000000)(Random.nextInt()).sorted.take(5).toList 

this goes for take and friends (slice, takeRight, drop, dropRight, ...)

  1. avoid forcing at all when you don't need to. This is for performance but also for surprise-avoidance and consistency. The view and iterator apis are in general always lazy, so users can generally build up a tree of views and count on them not being forced until the terminal operation at the end:
val v: View[Int] = ???

View.fill(10)(1).concat(v.sorted).take(5).toList 

^ Under the strict approach, v would have been forced, but it turned out we never even needed it at all.

joshlemer avatar Jan 21 '21 16:01 joshlemer

Either way it works for me. It is not a lazy operation but I think it just conforms to the new way of collection conversion, which is using view, and avoiding adding a new API method

neontorrent avatar Jan 21 '21 18:01 neontorrent

currently, View#sorted is only partially lazy - it computes the size eagerly to ensure that the collection is finite

NthPortal avatar Jan 21 '21 23:01 NthPortal