Replace bool[] sampleSet with HashSet<int>
Found while analyzing a .gcDump, there are 858 bool[] arrays in LOH, largest are 4 mb each, taking total 950 mb of space, almost empty, mostly not rooted. Replace with HashSet
I captured an ETW trace afterwards, HashSet
@ninedan The issue arose when I was analyzing gcdump files with ~70-90 million nodes and extremely deep recursion (e.g. linked lists with 1 million+ nodes).
@sharwell, are you worried about HashSet.Add time or HashSet.Contains time? When you're analyzing gcDump file with 70-90 million nodes, how many nodes are they sampled down to? How much would HashSet version cost, relative to total CPU usage?
how many nodes are they sampled down to
I never do sampling. There were 70-90 million nodes in the visible stacks.
How much would HashSet version cost, relative to total CPU usage?
I don't remember how much difference this particular change made. The full set of changes resulted in a reduction from several hours of processing down to a few seconds.
@sharwell , check new change. I'm adding a new class IndexSet which automatically switches to use bitset when there are enough element to make it more memory efficient.
@ninedan I'm not opposed to a change to reduce memory usage, but keep in mind that execution time is the overwhelming dominant concern. If someone was to submit a pull request that improved execution time at the expense of another 40GiB of runtime memory overhead, I would happily support it (and merge it into my interner branch even if it doesn't make its way to master).
For this change, I'd like to see BenchmarkDotNet results showing that the new IndexSet class is no slower at scale at runtime than the current byte[] approach. If the performance is the same or better, then I would not be opposed to the change.
@sharwell, can you provide a test case which shows CPU time with this integer set is over great importance?
I think CPU time is never the real issue. The real issue is always memory. With lots of LOH allocations, GC can be extremely expensive, causing lots of CPU usage. So the key is actually reduce memory usage, especially LOH allocations. The current solution has at least two issues: 1) resizing in a loop, 2) using bool[] array when bit set is enough.
The third issue is when the range of the indexes are large enough, there could be thousands of big LOH allocations, as I saw in my analysis with 3 million object .gcDump file.
The latest version address all these 3 issues.
@ninedan Unfortunately this change predated when I was keeping track of specific test scenarios. I'll try to keep a watch for them in the future.
I think CPU time is never the real issue. The real issue is always memory.
End-to-end execution time on the wall clock is the only consideration I use for making performance changes to address scaling problems in PerfView. Sometimes it's CPU usage for algorithms, and other times it's CPU usage from GC. Whatever it takes to get the application to finish faster without compromising the quality of results is the approach to go with.
I do think that both execution time and memory are concerns here. At this point, it's not clear to me whether or not we do want to take this change. I think a prerequisite to taking this particular change is going to be to figure out what types of test cases we can run against it to see how it behaves under load. I've filed #1041 so that we can figure out what kinds of tests we should run on this.
@ninedan, thank you for the contribution. Let's work through the testing aspect of this and then we can evaluate the change.