MoreLINQ icon indicating copy to clipboard operation
MoreLINQ copied to clipboard

IntersectBy Operator

Open atifaziz opened this issue 8 years ago • 11 comments

Similar to DistinctBy but with the functionality of Intersect

so Something like this:

var emails = GetAllEmails();
var uniqueEmails = emails.DistinctBy(x => x.Email).ToList();
var duplicates = emails.IntersectBy(uniqueEmails, x => x.EmailAddress);

Thanks


Originally reported on Google Code with ID 63

Reported by jalchr on 2011-02-14 09:48:06

atifaziz avatar Aug 21 '15 18:08 atifaziz

Or ExceptBy operator


Reported by jalchr on 2011-02-14 10:45:32

atifaziz avatar Aug 21 '15 18:08 atifaziz

I'm not sure there is much advantage over the alternative:

var emails = GetAllEmails();
var uniqueEmails = emails.DistinctBy(x => x.Email).ToList();
var duplicates = emails.Where(e => uniqueEmails.Contains(e.EmailAddress));

Is the increase in readability significant?

fsateler avatar Apr 07 '16 13:04 fsateler

If you put Contains() inside the predicate of Where(), it will get called once per iteration, which should put execution time around O(n^2). An IntersectBy operator can be optimized to run a lot faster against large collections.

Also, the example would need to be modified to behave as "IntersectBy" rather than just "Intersect".

var items= GetAllItems();
var uniqueValues= items.Select(x => x.Property).Distinct().ToList();
var duplicateItems = items.Where(e => !uniqueValues.Contains(e.Property));

JamesFaix avatar Apr 07 '16 22:04 JamesFaix

Actually, that would be m×n, but that can be solved by using a hashset instead of a list.

fsateler avatar Apr 07 '16 23:04 fsateler

You are correct on both counts.

But more to the point, we're now looking at a combination of about 5 function calls that require the client to worry about implementation details for optimizing performance, which could all be replaced by one sleek extension method. This is what LINQ is for.

JamesFaix avatar Apr 09 '16 18:04 JamesFaix

public static IEnumerable<TSource> IntersectBy<TSource, TKey>(
    this IEnumerable<TSource> first, 
    IEnumerable<TSource> second,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey> keyComparer = null) {

        if (first == null) throw new ArgumentNullException("first");
        if (second== null) throw new ArgumentNullException("second");
        if (keySelector== null) throw new ArgumentNullException("keySelector");

        keyComparer = keyComparer ?? EqualityComparer<TKey>.Default;

        var keys = new HashSet<TKey>(second.Select(keySelector), keyComparer);
        foreach (var item in first) {
            var k = keySelector(item);
            if (keys.Contains(k)) {
                yield return item;
                keys.Remove(k);
            }
        }
}

JamesFaix avatar Apr 09 '16 18:04 JamesFaix

@JamesFaix that has the problem of iterating the source list twice

public static IEnumerable<TSource> IntersectByImpl<TSource, TKey>(
    this IEnumerable<TSource> first, 
    IEnumerable<TSource> second,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey> keyComparer = null) {

        keyComparer = keyComparer ?? EqualityComparer<TKey>.Default;

        var keys = new HashSet<TKey>(keyComparer);
        foreach (var item in first) {
            var k = keySelector(item);
            if (keys.Add(k)) {
                yield return item;
                keys.Remove(k);
            }
        }
}

A separate overload where the sequences are not of the same type (eg, the second can be IEnumerable <TKey>) could be useful as well.

fsateler avatar Apr 10 '16 16:04 fsateler

It doesn't exactly iterate the source twice, it iterates each source list once, and then does some partial iterations on the HashSet when calling Contains. I believe internally there is some Contains-type overhead when calling HashSet.Add, since it must check for duplicate values some how.

I agree, the overload is very useful. But your implementation looks like the Distinct operator. You never use second.

    public static IEnumerable<TSource> IntersectBy<TSource, TKey>(
        this IEnumerable<TSource> first,
        IEnumerable<TKey> second,
        Func<TSource, TKey> keySelector,
        IEqualityComparer<TKey> keyComparer = null) {

        keyComparer = keyComparer ?? EqualityComparer<TKey>.Default;

        var keys = new HashSet<TKey>(second, keyComparer);
        foreach (var item in first) {
            var k = keySelector(item);
            if (keys.Contains(k)) {
                yield return item;
                keys.Remove(k);
            }
        }
    }

JamesFaix avatar Apr 11 '16 23:04 JamesFaix

The PR seems to be dead since 2016. Is it blocked for any reason? The IntersectBy overload is very useful.

shravan2x avatar Mar 31 '18 03:03 shravan2x

Seconded - came here after wondering why we get ExceptBy but not IntersectBy.

NeilMacMullen avatar May 13 '19 13:05 NeilMacMullen

I think IntersectBy would be very useful as well. Especially considering that ExceptBy is now missing its friend :'(

bert2 avatar Jul 11 '19 12:07 bert2