[BUG] Memory allocations from flushing L2 can lead to significant delays between benchmark executions

Open jrhemstad opened this issue 3 years ago • 0 comments

Description

In an attempt to gather more accurate timings, nvbench will "flush" the L2 cache by querying the device's L2 cache size, allocating device memory of that size, memset that memory to zero, and then free it.

NVBench will do this between every cold iteration. This can be quite expensive if there are a large number of cold iterations or points in the benchmark axis space. @GregoryKimball reported that this can cause up to a 1.2s delay between each iteration as cudaMalloc/cudaFree can be quite expensive.

Possible Solutions

Add option to disable flushing L2 cache
Avoid allocating/freeing every time and instead make a single allocation per device and memset the same every allocation each time.
Enable user to provide their own allocator to allocate the memory used for flushing.

Sep 03 '22 19:09 jrhemstad