SplitAt implementation
This PR is an implementation of SplitAt as proposed in #315.
There is missing a test for infinite sequence. Something like:
[Test]
public void SplitAtWithInfiniteSequence()
{
var naturals = MoreEnumerable.Generate(1, x => x + 1);
var (first, second) = naturals.SplitAt(10);
Assert.That(first, Is.EquivalentTo(Enumerable.Range(1, 10)));
Assert.That(second.Take(10), Is.EquivalentTo(Enumerable.Range(11, 10)));
}
@leandromoh That test can never work. Note the remarks comments:
/// <remarks>
/// This method uses deferred execution semantics but iterates the
/// <paramref name="source"/> entirely when either of the sequences
/// resulting from split are iterated.
/// </remarks>
Since the sequence will be iterated entirely, an infinite one will never end or stop when memory has been exhausted.
I replaced your definition by the following and all tests passed, including that which I comment for Infinite sequence. Would the following version be preferred after some optimizations?
public static TResult SplitAt<T, TResult>(
this IEnumerable<T> source, int index,
Func<IEnumerable<T>, IEnumerable<T>, TResult> resultSelector)
{
if (source == null) throw new ArgumentNullException(nameof(source));
if (resultSelector == null) throw new ArgumentNullException(nameof(resultSelector));
if (index <= 0)
return resultSelector(Enumerable.Empty<T>(), source);
if (source is ICollection<T> col && index >= col.Count)
return resultSelector(source, Enumerable.Empty<T>());
var list = new List<T>();
var cached = list.Concat(source.AsEnumerateOnce().Pipe(x => list.Add(x)));
return resultSelector(cached.Take(index), cached.Skip(index));
}
where AsEnumerateOnce definition is
private static IEnumerable<T> AsEnumerateOnce<T>(this IEnumerable<T> source)
{
IEnumerator<T> ector = null;
return _(); IEnumerable<T> _()
{
ector = ector ?? source.GetEnumerator();
while (ector.MoveNext())
yield return ector.Current;
ector.Dispose();
}
}
@leandromoh So you completely changed the implementation to make it work with infinite sources. I'd say it's a little detail you missed out earlier. 😆 I have a hard time reasoning about the code now due to all the side-effects (Pipe) and will have to inspect more closely. I also see a mini Memoize in there too. Perhaps we should finish #211 that seems to have stalled again? It's a noble goal to try and work with infinite sequences but your proposal does make the second part somewhat inefficient due to the SkipWhile. It may be acceptable for small values of index but not otherwise. Also the implementation I'm proposing returns lists for the two parts, which will be efficient for any operators optimized for lists or collection. However, I'm not sure anyone should count on that; I consider it an an implementation detail at the moment.
Thanks for your suggestions & ideas. Keep 'em coming!
@leandromoh So you completely changed the implementation to make it work with infinite sources.
The idea was to create the simplest draft to pass in all tests. Complex code turns hard to reason about. As C.A.R. Hoare quote: “There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.”
I have a hard time reasoning about the code now due to all the side-effects (Pipe) and will have inspect more closely.
Since my draft passed in all wroten test, I suppose it is a perfectly valid implementation.
It's a noble goal to try and work with infinite sequences
I think all LINQ operators must work with infinit sequences, except aggregation functions (count, max, min, aggregate, etc). otherwise the method has a limitation that is must not have. LINQ operators must be lazy as possible.
your proposal does make the second part somewhat inefficient due to the SkipWhile
As I said, its only a draft without all possible optmizations. If we take it as the current implementation for this PR, we can add more optimizations for it, and In the worst case, we can apply the mini memoize to it too and turn it efficient;
Since the sequence will be iterated entirely
I think it shoud be avoided, and the sequence must be as lazy iterated as possible. Imagine that source is long, for example 10.000 elements, and user just split it at 10 and consume only its 25 first elements (first = 10, and second = 15). the current implementation iterate 9975 elements for nothing. This dont make sense and user of course dont expect that. Of course, it is also not performatic.
the implementation I'm proposing returns lists for the two parts, which will be efficient for any operators optimized for lists or collection. However, I'm not sure anyone should count on that; I consider it an an implementation detail at the moment.
I think it is more important that methods works with infinit sequences than to have optmizations. Users are not too concerned about optimizations but expect method works with an infinit list if he is not iterating it entirely.
Thanks for your suggestions & ideas. Keep 'em coming!
you are welcome, I am here to help and contribute 😆
I also see a mini Memoize in there too. Perhaps we should finish #211 that seems to have stalled again?
Being honest, the fact of #211 had been submitted 6 or 7 months ago and it is not still be acceptable for merge made me a bit exhausted. I will be happy when it be released, but for now, I think it is better dont account or wait for it.
Love that quote by Hoare every time I read it.
Since my draft passed in all wroten test, I suppose it is a perfectly valid implementation.
True, but aren't you assuming that the tests are perfect? 😃 Look how in #319 I made a capital mistake because I was missing a test (not a problem of coverage).
I agree with all your points about staying as lazy as possible. I realized the limitations and implementation choices, which is why I called them out in the remarks sections. I'll reconsider based on your feedback.
Being honest, the fact of #211 had been submitted 6 or 7 months ago and it is not still be acceptable for merge made me a bit exhausted. I will be happy when it be released, but for now, I think it is better dont account or wait for it.
Sorry to hear that. If it's any consolation, it's been very consuming to review and fix.
I will be happy when it be released, but for now, I think it is better dont account or wait for it.
That's a shame because a lot of work & thought went into it. I feel now that my initial inclination to just go with the version in System.Interactive may have been the right call.
Sorry to hear that. If it's any consolation, it's been very consuming to review and fix.
No problem. All we have a life outside github, with duties, and priorities.
I feel now that my initial inclination to just go with the version in System.Interactive may have been the right call.
I think the problem is not about our work spent, but about define a MVP. The Memoize PR's release is stopped 4 months only because "Dispose should really mean Reset,".
@atifaziz Any news?
@leandromoh Not much news, I'm afraid. 😞 I don't know if you followed #337, but I think the idea of using a reader monad to express partial consumption would work here with large and infinite sequences. For example:
var source = MoreEnumerable.Generate(0, x => x + 1);
var (xs, ys) = source.SplitAt(10, ys => ys.Take(25));
So xs will have ≤ 10 elements and ys will have ≤ 25 and at most 35 elements of source will have been consumed even though it's an infinite sequence.
2a094a87fda3b268fb68b1a4954d7298b8b0ad93 updates the implementation based on my revised proposal for #315 where split parts are returned as sub-sequences instead of a pivoted couple (tuple of 2) of parts.
:memo: Notes:
- Clean-up implementation
- Consider changing
offsetstype toIEnumerable<int>(params int[]was out of laziness) - Add more test cases of various offsets
- Optionally, optimise for corner cases where
sourcecan be returned verbatim, though this can be done post-merge of this PR
@atifaziz I liked a lot this new implementation. How about receive a predicate (Func<T, int, bool>) to indicate a split? This allow user split by other criterious. You can write the Index overload in terms of this new overload wrapping int[] offesets with the predicate.
Ex.: (e, Index) => offsets.Contains(Index).
I liked a lot this new implementation.
@leandromoh Glad to hear.
How about receive a predicate (Func<T, int, bool>) to indicate a split? This allow user split by other criterious.
It's an idea but have to be careful about avoiding confusion with Split and Segment.
I did not know that Segment can receive an predicate. Why dont write SplitAt in terms of Segment?
Why dont write SplitAt in terms of Segment?
It would almost work but it can't be used because Segment doesn't always produce a sub-sequence. With SplitAt, if the source sequence is too short then you will still get a blank sub-sequence.
Dispatch #633 may have been used to implement this.