MoreLINQ icon indicating copy to clipboard operation
MoreLINQ copied to clipboard

Find ranges

Open wertzui opened this issue 5 years ago • 8 comments

It would be handy to have a method to find ranges of consecutive Numbers (int or long) in a given (ordered) collection.

Something like public static IEnumerable<Tuple<long, long>> FindConsecutiveRanges(this IEnumerable<long> source)

wertzui avatar Nov 01 '19 10:11 wertzui

I believe you can achieve this with a combination of Lag and Segment:

var nums = new[] { -5, 1, 2, 3, 15, 16, 17, 95, 100, 101, 102, 103, 104 };

var ranges =
    nums.Select(n => (Some: true, Num: n))
        .Lag(1, (curr, prev) => (Curr: curr, Prev: prev))
        .Segment(e => !e.Prev.Some || e.Curr.Num - e.Prev.Num != 1)
        .Select(g => (First: g.First().Curr.Num, Last: g.Last().Curr.Num));
        
Console.WriteLine(string.Join(", ", ranges));

The output will be:

(-5, -5), (1, 3), (15, 17), (95, 95), (100, 104)

One could then extrapolate the following generic implementation:

public static IEnumerable<(T, T)> ConsecutiveRanges<T>(this IEnumerable<T> source,
                                                       Func<T, T, bool> f) =>
    source.Select(n => (Some: true, Num: n))
          .Lag(1, (curr, prev) => (Curr: curr, Prev: prev))
          .Segment(e => !e.Prev.Some || !f(e.Prev.Num, e.Curr.Num))
          .Select(g => (First: g.First().Curr.Num, Last: g.Last().Curr.Num));

atifaziz avatar Nov 01 '19 12:11 atifaziz

I think you can avoid the introduction of Tuple with Lead and SkipLast:

var nums = new[] { -5, 1, 2, 3, 15, 16, 17, 95, 100, 101, 102, 103, 104 };
var ranges =
    nums.Lead(1, (curr, next) => (Curr: curr, Next: next))  // ValueTuple creation overload welcome here
        .SkipLast(1) // SkipLast() doesn't exist
        .Segment(e => e.Next - e.Curr != 1)
        .Select(g => (First: g.First().Curr, Last: g.Last().Curr));
Console.WriteLine(string.Join(", ", ranges));

Eye compiled, I can't test it right now.

Edit

It will not work because the last element will be missing. But we can use the other overload:

var nums = new[] { -5, 1, 2, 3, 15, 16, 17, 95, 100, 101, 102, 103, 104 };
var ranges =
    nums.Segment((current, previous,i) => current-previous != 1)
        .Select(g => (First: g.First(), Last: g.Last()));
Console.WriteLine(string.Join(", ", ranges));

Related #632

Orace avatar Nov 01 '19 18:11 Orace

Both of your Solutions would enumerate the source collection multiple times. But finding ranges can be achieved with just one enumeration.

    public static IEnumerable<(long Start, long End)> FindConsecutiveRanges(this IEnumerable<long> values)
    {
        if (values == null)
            throw new ArgumentNullException(nameof(values));

        using (var enumerator = values.GetEnumerator())
        {
            if (!enumerator.MoveNext())
                yield break;

            var rangeStart = enumerator.Current;
            var rangeEnd = rangeStart;

            while (enumerator.MoveNext())
            {
                var value = enumerator.Current;
                if (value == rangeEnd + 1)
                {
                    rangeEnd++;
                }
                else if (value <= rangeEnd)
                {
                    throw new ArgumentException("The input must be sorted.", nameof(values));
                }
                else
                {
                    yield return (rangeStart, rangeEnd);
                    rangeStart = value;
                    rangeEnd = rangeStart;
                }
            }

            yield return (rangeStart, rangeEnd);
        }
    }

wertzui avatar Nov 02 '19 07:11 wertzui

Both of your Solutions would enumerate the source collection multiple times.

I don't think so, both Lead and Lag use a Queue (here of size 1) in their implementation to avoid multiple iterations.

Anyway, the code you provided show that our implementations are not optimal in memory.

Orace avatar Nov 02 '19 08:11 Orace

Both of your Solutions would enumerate the source collection multiple times.

I don't think so, both Lead and Lag use a Queue (here of size 1) in their implementation to avoid multiple iterations.

One iteration to generate the range-collection and then each range-collection is iterated to generate the Tuples (technically First() will also generate an Enumerator but will immediately return the first element, the full iteration is done by Last()).

wertzui avatar Nov 02 '19 08:11 wertzui

Last() is optimised for collection and O(1). But as I said, the memory is indeed allocated and indeed the allocation is O(n).

Orace avatar Nov 02 '19 08:11 Orace

Last() is optimised for collection and O(1). But as I said, the memory is indeed allocated and indeed the allocation is O(n).

I just had a look at the source of Segment() and it is using List<T> internally. So as long, as this internal implementation won't change, you are correct. But when for whatever reason this changes and is no longer using an indexable collection, it would become O(n).

wertzui avatar Nov 02 '19 08:11 wertzui

But we can use the other overload:

var nums = new[] { -5, 1, 2, 3, 15, 16, 17, 95, 100, 101, 102, 103, 104 };
var ranges =
    nums.Segment((current, previous,i) => current-previous != 1)
        .Select(g => (First: g.First(), Last: g.Last()));
Console.WriteLine(string.Join(", ", ranges));

@Orace Good one! I had completely forgotten about the other overload of Segment 🤦‍♂and so got distracted by Lag. Your version is far simpler!

Both of your Solutions would enumerate the source collection multiple times.

@wertzui As you discovered, Segment yields lists so there is no re-iteration when fetching the first and last elements of a segment. #317 would have been helpful to assert that quickly.

So as long, as this internal implementation won't change, you are correct. But when for whatever reason this changes and is no longer using an indexable collection, it would become O(n).

It cannot change, not within the frame of LINQ to Objects or an in-memory execution model. A different implementation of Segement can change that however; for example, one that operates remotely. This is the reason Segment returns a sequence of sequences and not a sequence of lists. Have a look at my comment on #98 for more background on this. I think it's a noble goal to not contractually bind other possible implementations in case we ever bring IQueryable into MoreLINQ but it hasn't happened in 10 years and I don't think it's going to happen in the foreseeable future either so I am beginning to reconsider the signatures to be more reflective and bound to the in-memory execution model. In fact, if you look at Window, we already have that. You then wouldn't have to go popping the hood to look at what the implementation is really doing.

So we are down to just Segement + Select:

public static IEnumerable<(T First, T Last)>
    ConsecutiveRanges<T>(this IEnumerable<T> source, Func<T, T, bool> f) =>
        from g in source.Segment((curr, prev, _) => !f(prev, curr))
        select (g.First(), g.Last());

The only downside I see here is that each segment will be entirely committed to memory just to throw away all but its head and tail elements. That could be unacceptable in some scenarios. For example, I can imagine going over data files in a directory that contain time series data and validate that there are no “holes” in the timeline; in other words that there's only a consecutive segment. If the data across all files amounts to gigabytes in size then an approach using Segment could exhaust memory whereas a hand-rolled and efficient implementation wouldn't.

If we do this, I would go one step further and actually generalise Segment instead and make it even more powerful by describing how to aggregate each segment. The approach I see taking here is the same I did with Aggregate over observable comprehensions (see #326 for background). In other words, it would come down to something like this:

var result =
    nums.Segment((curr, prev, i) => curr - prev != 1,
                 s => s.FirstAsync(), // from Rx
                 s => s.LastAsync(),  // from Rx
                 (fst, lst) => (First: fst, Last: lst));

If you also wanted a count then it would be just as simple as doing:

var result =
    nums.Segment((curr, prev, i) => curr - prev != 1,
                 s => s.FirstAsync(),
                 s => s.LastAsync(),
                 s => s.Count(),
                 (fst, lst, cnt) => (First: fst, Last: lst, Count: cnt));

atifaziz avatar Nov 02 '19 09:11 atifaziz