Davide Rossetti
Davide Rossetti
optimized memcpy implementations should be chosen at run-time during a tuning phase, possibly in gdr_open()
the problem is in run_iter_bw_infinitely(), where the call to pthread_sigmask happens too late, when the CUDA driver has been initialized and its worker thread launched. the solution is to move...
see https://github.com/linux-rdma/perftest/blob/6369e620429197f7cc0b6bfcb9734fe70f0b92f0/src/perftest_resources.c#L4222
perftest should support benchmarking of these new kinds of memory. There are two basic variants of CUDA Unified Memory: 1. managed memory, as allocated via cudaMallocManaged() 2. system allocated memory,...
send_lat could easily support CUDA device memory as source/sink. write_lat cannot do the same as easily, as it relies on direct memory polling from the CPU.
This is bad, as people are convinced to be running a build with GPU support, so for example they report unreasonable performance.