bug icon indicating copy to clipboard operation
bug copied to clipboard

IterableOnceOps.sum creates multiple iterators

Open srdo-humio opened this issue 2 years ago • 6 comments

Reproduction steps

Scala version: 2.13.6

    val someSeq = Seq(1).toSeq
    for (i <- 1 to 100_000) {
      println(someSeq.view.filter(i => Random.nextBoolean()).sum)
    }

The above is very likely to produce a stack trace like this:

java.lang.UnsupportedOperationException: empty.reduceLeft
  at scala.collection.IterableOnceOps.reduceLeft(IterableOnce.scala:727)
  at scala.collection.IterableOnceOps.reduceLeft$(IterableOnce.scala:724)
  at scala.collection.AbstractIterable.reduceLeft(Iterable.scala:926)
  at scala.collection.IterableOnceOps.reduce(IterableOnce.scala:698)
  at scala.collection.IterableOnceOps.reduce$(IterableOnce.scala:698)
  at scala.collection.AbstractIterable.reduce(Iterable.scala:926)
  at scala.collection.IterableOnceOps.sum(IterableOnce.scala:889)
  at scala.collection.IterableOnceOps.sum$(IterableOnce.scala:889)
  at scala.collection.AbstractIterable.sum(Iterable.scala:926)
  <snip>

Problem

IterableOnceOps.sum creates two different iterators for the source collection. Those iterators can be out of sync if the data source is lazy. This can cause exceptions to be thrown from sum.

In order to determine if the source collection is empty, sum creates an iterator and calls hasNext. If that iterator is non-empty, sum delegates to reduce(num.plus), which creates a new, separate iterator to compute the sum. reduce fails on empty iterators, and for lazy collections involving filters, there is no guarantee that two separate iterators will agree on the state of the source collection.

I think this can be fixed by making sum create an iterator it can pass to reduce, instead of having the is-empty check create an iterator that isn't used by the reduce operation.

srdo-humio avatar Jul 18 '22 10:07 srdo-humio

I would expect it to use foldLeft(num.zero)(num.plus) directly. But a more interesting question is what to do with all the other methods which use the same pattern with isEmpty?

joroKr21 avatar Jul 18 '22 11:07 joroKr21

Does this fall under the caveat for isEmpty, broadly construed:

    *  Note: Implementations in subclasses that are not repeatedly traversable must take
    *  care not to consume any elements when `isEmpty` is called.

Naively, I would expect the underlying to know what has been evaluated.

som-snytt avatar Jul 18 '22 12:07 som-snytt

@scala/collections ?

SethTisue avatar Jul 18 '22 19:07 SethTisue

it's hard for me to decide whether or not this is even a bug. it's probably not hard to fix, but from the perspective of isEmpty's implementation, you've changed the collection during isEmpty's execution, and I'm not sure why you expected that to go any way but poorly

NthPortal avatar Aug 23 '22 10:08 NthPortal

from the perspective of isEmpty's implementation, you've changed the collection during isEmpty's execution, and I'm not sure why you expected that to go any way but poorly

Okay, but you do see how this behavior makes View somewhat dangerous to use? Also isEmpty is in a class called IterableOnceOps, which I take to mean the code should assume the underlying collection can be iterated only once, so creating two different iterators and calling hasNext on both seems like a bad idea in any case. Some types of iterator may use hasNext to load the next item from the underlying source that we may only be able to read once into the iterator, so calling hasNext on two different new iterators isn't guaranteed to give the same result.

srdo-humio avatar Sep 01 '22 09:09 srdo-humio

This is a really unfortunate feature interaction between view not memoizing and IterableOnceOps assuming that if it keeps asking for iterators it keeps getting the same iterator.

There's no justification for the library to have a footgun like this implemented in it. It's conceptually trivial to fix: for every operation, all IterableOnceOps should do is get an iterator and defer to it. In practice, switching every use of isEmpty to knownSize == 0 and then calling a method that does the empty-check again might be less churn on the code and might perform better in case of collection that knows when it's empty. Critically, if you need to get a new iterator, knownSize returns -1.

For instance, sum would be implemented as

def sum[B >: A](implicit num: Numeric[B]): B =
  if (knownSize == 0) num.zero else iterator.sum

The extension method was always just iterator.sum. This code was apparently an attempt at optimization that went awry. product, most of the min and max code, and mkString all seem to have the same unfortunate optimization.

Ichoran avatar Sep 01 '22 10:09 Ichoran

The PR follows Ichoran. The hiccup is an existing optimization for IndexedSeq for binary compat reasons.

In other archaeology, I'm sorry I missed the ticket where Ichoran said, "You can close this as not a bug, but the whole design is a bug." Because of math context, BigDecimal ops are not commutative. That also answers joroKr's question why isn't it foldLeft.

It just took me a second to wonder why I didn't break that case again: it still uses reduction and the z value only on empty.

Also, breaking the case is good for detectives, bad for programmers.

som-snytt avatar Nov 07 '22 18:11 som-snytt