fsharp icon indicating copy to clipboard operation
fsharp copied to clipboard

Slow performance of Set.intersects when comparing two sets of different sizes

Open akselkvitberg opened this issue 1 month ago • 3 comments

I have found that Set.intersect have greatly different performance depending on the order of the arguments passed.

Here is a reproduction sample

type SetIntersectBenchmark() =
    let longSet = [ for _ in 1 .. 1_000_000 -> Random.Shared.Next() ] |> Set.ofList
    let shortSet = Set.ofList [1;2;3;4;5;6;7;8;9;10]

    [<Benchmark>]
    member _.intersect_long_set_with_short_set() = Set.intersect longSet shortSet
    
    [<Benchmark(Baseline=true)>]
    member _.intersect_short_set_with_long_set() = Set.intersect shortSet longSet

Expected behavior

Set.intersect a b should produce the same result as Set.intersect b a. I would expect the performance to be equal.

Actual behavior

Performance differ by several magnitudes

Method Job Arguments IterationCount LaunchCount WarmupCount Mean Error StdDev Median Ratio RatioSD Allocated Alloc Ratio
intersect_long_set_with_short_set Job-BBQTFN /p:BUILDING_USING_DOTNET=true Default Default Default 113,804,493.5 ns 3,993,161.34 ns 11,584,885.09 ns 113,180,500.0 ns 210,173.97 29,869.00 - 0.00
intersect_short_set_with_long_set Job-BBQTFN /p:BUILDING_USING_DOTNET=true Default Default Default 547.0 ns 18.29 ns 52.78 ns 525.3 ns 1.00 0.00 40 B 1.00
intersect_long_set_with_short_set Job-XIHKCK /p:BUILDING_USING_DOTNET=true Default Default Default 107,596,475.3 ns 6,650,312.63 ns 19,504,215.74 ns 100,039,800.0 ns 199,488.09 41,392.25 - 0.00
intersect_short_set_with_long_set Job-XIHKCK /p:BUILDING_USING_DOTNET=true Default Default Default 602.0 ns 21.11 ns 61.25 ns 577.7 ns 1.11 0.16 40 B 1.00
intersect_long_set_with_short_set Job-MMDXPR Default 2 2 1 104,123,106.2 ns 132,585,779.86 ns 20,517,796.20 ns 104,504,000.0 ns 199,902.65 30,687.64 - 0.00
intersect_short_set_with_long_set Job-MMDXPR Default 2 2 1 566.6 ns 268.16 ns 41.50 ns 569.6 ns 1.10 0.15 40 B 1.00

Proposed change

Due to how comparing two sets are implemented; loop through all elements in the first set, check if it is present in the second set, then add to intersection set, if the first set is huge, then we will have to loop through every item in the set, performing potentially hundred of thousands of comparisons. On the other hand, looping 10 times, checking for membership in a list of 1000000 elements is trivial.

I believe Big O of intersect to be O(na log nb), which indicates that the order of operations will be for small a and large b around 135 operations, while large a and small b around 2.25 million operations.

To improve the performance, simply compare the sizes of the sets passed in and flip the comparison if it is more performant.

akselkvitberg avatar Dec 05 '25 11:12 akselkvitberg

Imagine a comparison implemented in a way where it checks only some fields, but not all. Should Set.intersect a b return objects from a,b or should this be by intention an undefined behavior?

When it comes to docs, this particular scenario is not mentioned. So in this case, it would be a breaking change that has to be documented and announced, but one that could be taken if the benefits outweigh the risks. To see what I mean, these are to docs of LINQ Intersect, it already gives a promise about the behavior (more of an illustration. We are not bound to LINQ in F# and the idiomatic coding styles have different preferences for structural vs custom comparison/equality in C# vs F#)

The intersection of two sets A and B is defined as the set that contains all the elements of A that also appear in B, but no other elements.
When the object returned by this method is enumerated, Intersect yields distinct elements occurring in both sequences in the order in which they appear in first.

This proposal should then also cascade to intersectManywith similar heuristic (sort by size, start by smallest ?)

T-Gro avatar Dec 05 '25 13:12 T-Gro

By current implementation, the item from a is returned and that should probably be kept to avoid breaking changes. But there shouldn't be any reason why we cannot return the item from set a, even though we do the loop through b, right? And order of returned items shouldn't matter?

akselkvitberg avatar Dec 09 '25 12:12 akselkvitberg

@akselkvitberg :

Yes, that makes sense. If we can maintain that property (and add a test which checks which exact object/reference is in the result), this can for sure be accepted 👍

T-Gro avatar Dec 11 '25 08:12 T-Gro