MoreLINQ
MoreLINQ copied to clipboard
Find ranges
It would be handy to have a method to find ranges of consecutive Numbers (int or long) in a given (ordered) collection.
Something like public static IEnumerable<Tuple<long, long>> FindConsecutiveRanges(this IEnumerable<long> source)
I believe you can achieve this with a combination of Lag and Segment:
var nums = new[] { -5, 1, 2, 3, 15, 16, 17, 95, 100, 101, 102, 103, 104 };
var ranges =
nums.Select(n => (Some: true, Num: n))
.Lag(1, (curr, prev) => (Curr: curr, Prev: prev))
.Segment(e => !e.Prev.Some || e.Curr.Num - e.Prev.Num != 1)
.Select(g => (First: g.First().Curr.Num, Last: g.Last().Curr.Num));
Console.WriteLine(string.Join(", ", ranges));
The output will be:
(-5, -5), (1, 3), (15, 17), (95, 95), (100, 104)
One could then extrapolate the following generic implementation:
public static IEnumerable<(T, T)> ConsecutiveRanges<T>(this IEnumerable<T> source,
Func<T, T, bool> f) =>
source.Select(n => (Some: true, Num: n))
.Lag(1, (curr, prev) => (Curr: curr, Prev: prev))
.Segment(e => !e.Prev.Some || !f(e.Prev.Num, e.Curr.Num))
.Select(g => (First: g.First().Curr.Num, Last: g.Last().Curr.Num));
I think you can avoid the introduction of Tuple with Lead and SkipLast:
var nums = new[] { -5, 1, 2, 3, 15, 16, 17, 95, 100, 101, 102, 103, 104 };
var ranges =
nums.Lead(1, (curr, next) => (Curr: curr, Next: next)) // ValueTuple creation overload welcome here
.SkipLast(1) // SkipLast() doesn't exist
.Segment(e => e.Next - e.Curr != 1)
.Select(g => (First: g.First().Curr, Last: g.Last().Curr));
Console.WriteLine(string.Join(", ", ranges));
Eye compiled, I can't test it right now.
Edit
It will not work because the last element will be missing. But we can use the other overload:
var nums = new[] { -5, 1, 2, 3, 15, 16, 17, 95, 100, 101, 102, 103, 104 };
var ranges =
nums.Segment((current, previous,i) => current-previous != 1)
.Select(g => (First: g.First(), Last: g.Last()));
Console.WriteLine(string.Join(", ", ranges));
Related #632
Both of your Solutions would enumerate the source collection multiple times. But finding ranges can be achieved with just one enumeration.
public static IEnumerable<(long Start, long End)> FindConsecutiveRanges(this IEnumerable<long> values)
{
if (values == null)
throw new ArgumentNullException(nameof(values));
using (var enumerator = values.GetEnumerator())
{
if (!enumerator.MoveNext())
yield break;
var rangeStart = enumerator.Current;
var rangeEnd = rangeStart;
while (enumerator.MoveNext())
{
var value = enumerator.Current;
if (value == rangeEnd + 1)
{
rangeEnd++;
}
else if (value <= rangeEnd)
{
throw new ArgumentException("The input must be sorted.", nameof(values));
}
else
{
yield return (rangeStart, rangeEnd);
rangeStart = value;
rangeEnd = rangeStart;
}
}
yield return (rangeStart, rangeEnd);
}
}
Both of your Solutions would enumerate the source collection multiple times.
I don't think so, both Lead and Lag use a Queue (here of size 1) in their implementation to avoid multiple iterations.
Anyway, the code you provided show that our implementations are not optimal in memory.
Both of your Solutions would enumerate the source collection multiple times.
I don't think so, both
LeadandLaguse aQueue(here of size 1) in their implementation to avoid multiple iterations.
One iteration to generate the range-collection and then each range-collection is iterated to generate the Tuples (technically First() will also generate an Enumerator but will immediately return the first element, the full iteration is done by Last()).
Last() is optimised for collection and O(1).
But as I said, the memory is indeed allocated and indeed the allocation is O(n).
Last()is optimised for collection and O(1). But as I said, the memory is indeed allocated and indeed the allocation is O(n).
I just had a look at the source of Segment() and it is using List<T> internally. So as long, as this internal implementation won't change, you are correct. But when for whatever reason this changes and is no longer using an indexable collection, it would become O(n).
But we can use the other overload:
var nums = new[] { -5, 1, 2, 3, 15, 16, 17, 95, 100, 101, 102, 103, 104 }; var ranges = nums.Segment((current, previous,i) => current-previous != 1) .Select(g => (First: g.First(), Last: g.Last())); Console.WriteLine(string.Join(", ", ranges));
@Orace Good one! I had completely forgotten about the other overload of Segment 🤦♂and so got distracted by Lag. Your version is far simpler!
Both of your Solutions would enumerate the source collection multiple times.
@wertzui As you discovered, Segment yields lists so there is no re-iteration when fetching the first and last elements of a segment. #317 would have been helpful to assert that quickly.
So as long, as this internal implementation won't change, you are correct. But when for whatever reason this changes and is no longer using an indexable collection, it would become O(n).
It cannot change, not within the frame of LINQ to Objects or an in-memory execution model. A different implementation of Segement can change that however; for example, one that operates remotely. This is the reason Segment returns a sequence of sequences and not a sequence of lists. Have a look at my comment on #98 for more background on this. I think it's a noble goal to not contractually bind other possible implementations in case we ever bring IQueryable into MoreLINQ but it hasn't happened in 10 years and I don't think it's going to happen in the foreseeable future either so I am beginning to reconsider the signatures to be more reflective and bound to the in-memory execution model. In fact, if you look at Window, we already have that. You then wouldn't have to go popping the hood to look at what the implementation is really doing.
So we are down to just Segement + Select:
public static IEnumerable<(T First, T Last)>
ConsecutiveRanges<T>(this IEnumerable<T> source, Func<T, T, bool> f) =>
from g in source.Segment((curr, prev, _) => !f(prev, curr))
select (g.First(), g.Last());
The only downside I see here is that each segment will be entirely committed to memory just to throw away all but its head and tail elements. That could be unacceptable in some scenarios. For example, I can imagine going over data files in a directory that contain time series data and validate that there are no “holes” in the timeline; in other words that there's only a consecutive segment. If the data across all files amounts to gigabytes in size then an approach using Segment could exhaust memory whereas a hand-rolled and efficient implementation wouldn't.
If we do this, I would go one step further and actually generalise Segment instead and make it even more powerful by describing how to aggregate each segment. The approach I see taking here is the same I did with Aggregate over observable comprehensions (see #326 for background). In other words, it would come down to something like this:
var result =
nums.Segment((curr, prev, i) => curr - prev != 1,
s => s.FirstAsync(), // from Rx
s => s.LastAsync(), // from Rx
(fst, lst) => (First: fst, Last: lst));
If you also wanted a count then it would be just as simple as doing:
var result =
nums.Segment((curr, prev, i) => curr - prev != 1,
s => s.FirstAsync(),
s => s.LastAsync(),
s => s.Count(),
(fst, lst, cnt) => (First: fst, Last: lst, Count: cnt));