skdb icon indicating copy to clipboard operation
skdb copied to clipboard

Transform NonEmptyIterator into a NonEmptySequence

Open mbouaziz opened this issue 10 months ago • 5 comments

Given that the iterator is now (since #276) fully traversed and saved into an Array when the second element is accessed, I suggest that it is transformed into a lazy Sequence-like structure, i.e.:

  • unlike an Iterator, we have the guarantee it can be traversed many times
  • we still want to be able to access the first element and check whether it is unique without traversing the full iterator
  • as a lazy value it still needs to be mutable though it should observably look like an immutable value

mbouaziz avatar Feb 21 '25 22:02 mbouaziz

Are we sure that the change to eagerly drain iterators and perform a couple copies to store the values in arrays is the desired thing to do long-term on the skip side? I see how for the js api wrappers the copies are needed, but that isn't the only use of the skip NonEmptyIterator (they are used in EHandles), and it is not clear to me that it is best to make them so much heavier long-term.

jberdine avatar Feb 24 '25 12:02 jberdine

From what I read, NonEmptyIterators usage are:

  1. use first and ignore the rest (usually because the author knows there are no other values), though I consider this a bad pattern
  2. use first and fail if there are other values, this is a fix of the previous usage
  3. use all values
  4. map to another NonEmptyIterator
  5. skipruntime

I want to keep 2 and 4 lazy.

The main problem I currently see with NonEmptyIterator is that it sometimes acts like a sequence, e.g. map and clone, and sometimes like an iterator which traverses the sequence. I want to fix it.

mbouaziz avatar Feb 24 '25 13:02 mbouaziz

I'm not sure I understand which behaviors you associate with "like a sequence" vs "like an iterator", but agree that no matter what NonEmptyIterator is fishy. My understanding is that there are 2 uses: one as an internal implementation detail for handles in order to support an emptiness test so that in the empty case an optimization to avoid the allocation of a TWriter can be done, and another use in the skipruntime. This second one comes from, AFAIU, the type of e.g. EHandle.map being

  fun map<K2: Key, V2: File>(
    typeKey: Key ~> K2,
    type: File ~> V2,
    context: mutable Context,
    dirName: DirName,
    f: (
      mutable Context,
      mutable TWriter<K2, V2>,
      K,
      mutable NonEmptyIterator<V>,
    ) ~> void,
    rangeOpt: ?Array<KeyRange> = None(),
  ): EHandle<K2, V2>

I don't know if it is an improvement that the arg of f is a NonEmptyIterator instead of just an Iterator. In either case, it does not seem that the eager draining that NonEmptyIterator does is useful in this and similar cases. The only point where the eager draining is needed AFAIU is in the JS bindings.

I guess my question is basically, if NonEmptyIterator lost the materializedIter field, and the JS binding code was changed to do the drain to array there instead, would that address the issues you see with NonEmptyIterator?

I suppose my preference, given the usage, would be to make NonEmptyIterator as much like Iterator as possible, rather than to rename it and make it fully "sequence-like", since the additional weight beyond Iterator is not needed for the core usage.

jberdine avatar Feb 24 '25 14:02 jberdine

A Sequence is a container that can be traversed many times, you can always start from the beginning. An Iterator is a single traversal, you can only work from where you're currently are.

NonEmptyIterator#map and clone restarts from the beginning, unlike Iterator#map, though NonEmptyIterator extends Iterator.

The name itself, NonEmpty, doesn't make much sense once you've called next() a first time.

Indeed the eager draining was added for JS, so that it behaves like a sequence and not an iterator.

I guess my question is basically, if NonEmptyIterator lost the materializedIter field, and the JS binding code was changed to do the drain to array there instead, would that address the issues you see with NonEmptyIterator?

That'll be a good starting point.

mbouaziz avatar Feb 24 '25 15:02 mbouaziz

A Sequence is a container that can be traversed many times, you can always start from the beginning. An Iterator is a single traversal, you can only work from where you're currently are.

NonEmptyIterator#map and clone restarts from the beginning, unlike Iterator#map, though NonEmptyIterator extends Iterator.

Indeed the eager draining was added for JS, so that it behaves like a sequence and not an iterator.

Ok. But at least in skip, being able to restart from the beginning doesn't necessarily imply saving the results of the first traversal since multiple traversals are allowed to produce different results. So this seems at least partly orthogonal to the needs of the JS interface.

The name itself, NonEmpty, doesn't make much sense once you've called next() a first time.

Agreed. But I'm not sure that these should be more than internal details of Handle.sk. That is, I don't know if it helps callers of e.g. EHandle.map to receive a NonEmptyIterator instead of an Iterator.

I guess my question is basically, if NonEmptyIterator lost the materializedIter field, and the JS binding code was changed to do the drain to array there instead, would that address the issues you see with NonEmptyIterator?

That'll be a good starting point.

Just curious, but do you already know things that would be left after such a change?

jberdine avatar Feb 24 '25 15:02 jberdine