BenchmarkDotNet icon indicating copy to clipboard operation
BenchmarkDotNet copied to clipboard

Allow to optionally run benchmarks in Parallel

Open adamsitnik opened this issue 7 years ago • 13 comments
trafficstars

We want to be able to run thousands of benchmarks on powerful CI machines (64 cores+), it would be great to run the benchmarks in parallel.

This should not be a default behavior, and the users should be allowed to enable/disable it for selected benchmarks. For some benchmarks it's a very bad idea ;)

adamsitnik avatar Apr 10 '18 13:04 adamsitnik

Good day! Any progress in this direction?

bazzilic avatar Jan 30 '22 07:01 bazzilic

hello @bazzilic !

I am sorry but we have made no progress.

adamsitnik avatar Jan 31 '22 07:01 adamsitnik

This would certainly make the .NET perf test passes easier and faster, but we would presumably need a way (attribute on method?) to indicate that a benchmark requires > 1 core (such as our parallel benchmarks). BDN would have to do a "join" then run those alone.

[More hypothetically possibly BDN could concievably have a mode where it "diagnoses" any benchmarks that seem to use significantly > 1 core when run alone, but don't have the attribute..]

danmoseley avatar Jul 28 '22 19:07 danmoseley

Thinking laterally about a different possible approach -- is there a way to affinitize BDN and its child processes to certain # of cores? then no other feature would be required -- you'd just kick off several BDN's each on a subset of the benchmarks, each affinitized to a different CPU mask.

danmoseley avatar Jul 28 '22 19:07 danmoseley

is there a way to affinitize BDN and its child processes to certain # of cores?

@danmoseley it's possible to affinitize the child process via --affinity and it's also possible to start the host process using some utility that allows for affinitization

Few months ago I've prototyped adding such support to BDN. The idea was simple: get all benchmarks, compile the code and start up to N-1 affinitized processes in parallel. As soon as one of them is done, pickup next available benchmark and run it. Don't report the results by writing to std out, provide each process the path to file where it should store the results.

https://github.com/dotnet/BenchmarkDotNet/compare/c165ba17501626561297a80fe1b05b2400ce8014...9e2d5c986f434b82aa20b37b7ecda2376a059fec

It was very promising:

image

On my 24-core machine the time to run dotnet/performance benchmarks dropped from 9h 49m to half an hour ;)

I would need some extra time to test it (how sharing CPU cache affects the benchmark results) and polish it (presumably run them only on physical cores (no hyper threading). I expect that it would take me a week of work.

adamsitnik avatar Jul 29 '22 08:07 adamsitnik

On my 24-core machine the time to run dotnet/performance benchmarks dropped from 9h 49m to half an hour ;)

Wow! What about perf tests that need more than one core, though? Eg Parallel.For, etc

danmoseley avatar Jul 29 '22 15:07 danmoseley

Also what about in-process benchmarks? I know allocations results will be affected, but I'm not sure what else might be.

And what about P-cores vs E-cores on Intel cpus (actually, are those even handled now)?

timcassell avatar Jul 30 '22 00:07 timcassell

On my 24-core machine the time to run dotnet/performance benchmarks dropped from 9h 49m to half an hour ;)

Wow! What about perf tests that need more than one core, though? Eg Parallel.For, etc

I think that could be solved by a int MaxThreads property on the Benchmark attribute that shows how many threads the benchmark needs. Default to 1, 0 means it should not be ran in parallel with other benchmarks (perhaps the parallelism is dynamic based on parameters).

timcassell avatar Jul 30 '22 01:07 timcassell

What about perf tests that need more than one core, though? Eg Parallel.For, etc

We could extend the [BenchmarkAttribute] with a property that disables parallelization for certain benchmarks (they would be executed sequentially at the end of benchmark run).

Also what about in-process benchmarks?

That would not be supported at the beginning, but could be implemented by some contributors by:

  • switching from GC.GetTotalAllocatedBytes to GC.GetAllocatedBytesForCurrentThread for MemoryDiagnoser (it supports both)
  • using ProcessThread.ProcessorAffinity on Windows to affinitize the threads and pthread_setaffinity_np sys-call on Linux

adamsitnik avatar Jul 30 '22 08:07 adamsitnik

@adamsitnik The prototype seems very promising! I'm hoping this feature makes it in at some point.

RedwoodForest avatar Mar 03 '23 19:03 RedwoodForest