Slow performance of Set.intersects when comparing two sets of different sizes
I have found that Set.intersect have greatly different performance depending on the order of the arguments passed.
Here is a reproduction sample
type SetIntersectBenchmark() =
let longSet = [ for _ in 1 .. 1_000_000 -> Random.Shared.Next() ] |> Set.ofList
let shortSet = Set.ofList [1;2;3;4;5;6;7;8;9;10]
[<Benchmark>]
member _.intersect_long_set_with_short_set() = Set.intersect longSet shortSet
[<Benchmark(Baseline=true)>]
member _.intersect_short_set_with_long_set() = Set.intersect shortSet longSet
Expected behavior
Set.intersect a b should produce the same result as Set.intersect b a. I would expect the performance to be equal.
Actual behavior
Performance differ by several magnitudes
| Method | Job | Arguments | IterationCount | LaunchCount | WarmupCount | Mean | Error | StdDev | Median | Ratio | RatioSD | Allocated | Alloc Ratio |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| intersect_long_set_with_short_set | Job-BBQTFN | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 113,804,493.5 ns | 3,993,161.34 ns | 11,584,885.09 ns | 113,180,500.0 ns | 210,173.97 | 29,869.00 | - | 0.00 |
| intersect_short_set_with_long_set | Job-BBQTFN | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 547.0 ns | 18.29 ns | 52.78 ns | 525.3 ns | 1.00 | 0.00 | 40 B | 1.00 |
| intersect_long_set_with_short_set | Job-XIHKCK | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 107,596,475.3 ns | 6,650,312.63 ns | 19,504,215.74 ns | 100,039,800.0 ns | 199,488.09 | 41,392.25 | - | 0.00 |
| intersect_short_set_with_long_set | Job-XIHKCK | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 602.0 ns | 21.11 ns | 61.25 ns | 577.7 ns | 1.11 | 0.16 | 40 B | 1.00 |
| intersect_long_set_with_short_set | Job-MMDXPR | Default | 2 | 2 | 1 | 104,123,106.2 ns | 132,585,779.86 ns | 20,517,796.20 ns | 104,504,000.0 ns | 199,902.65 | 30,687.64 | - | 0.00 |
| intersect_short_set_with_long_set | Job-MMDXPR | Default | 2 | 2 | 1 | 566.6 ns | 268.16 ns | 41.50 ns | 569.6 ns | 1.10 | 0.15 | 40 B | 1.00 |
Proposed change
Due to how comparing two sets are implemented; loop through all elements in the first set, check if it is present in the second set, then add to intersection set, if the first set is huge, then we will have to loop through every item in the set, performing potentially hundred of thousands of comparisons. On the other hand, looping 10 times, checking for membership in a list of 1000000 elements is trivial.
I believe Big O of intersect to be O(na log nb), which indicates that the order of operations will be for small a and large b around 135 operations, while large a and small b around 2.25 million operations.
To improve the performance, simply compare the sizes of the sets passed in and flip the comparison if it is more performant.
Imagine a comparison implemented in a way where it checks only some fields, but not all.
Should Set.intersect a b return objects from a,b or should this be by intention an undefined behavior?
When it comes to docs, this particular scenario is not mentioned. So in this case, it would be a breaking change that has to be documented and announced, but one that could be taken if the benefits outweigh the risks. To see what I mean, these are to docs of LINQ Intersect, it already gives a promise about the behavior (more of an illustration. We are not bound to LINQ in F# and the idiomatic coding styles have different preferences for structural vs custom comparison/equality in C# vs F#)
The intersection of two sets A and B is defined as the set that contains all the elements of A that also appear in B, but no other elements.
When the object returned by this method is enumerated, Intersect yields distinct elements occurring in both sequences in the order in which they appear in first.
This proposal should then also cascade to intersectManywith similar heuristic (sort by size, start by smallest ?)
By current implementation, the item from a is returned and that should probably be kept to avoid breaking changes.
But there shouldn't be any reason why we cannot return the item from set a, even though we do the loop through b, right?
And order of returned items shouldn't matter?
@akselkvitberg :
Yes, that makes sense. If we can maintain that property (and add a test which checks which exact object/reference is in the result), this can for sure be accepted 👍