gdrcopy Question on bad perf when concurrent copy on single GPU

we observe bad latency if concurrent copy_to_mapping on the single GPU, and want to understand the cause (known limit?) before we dive into.

env: X86 8163 2 socket (AVX2 supported), Nvidia T4 *2, PCIe3.0 * 16, CUDA driver 450.82, latest gdrcpy(2022.10)
Tests: 2 processes (bind to different core) concurrently running test/copylat on GPU0, each process alloc different host and dev memory addr of course.
result: i.e. 32KB, each process gdr_copy_to_mapping gets avg 6.2usec, vs. 3.2usec if run with single process. similar problem with other block size (such as 2KB ~ 256KB where I only focuse on small blocks) btw, if 2 processes run torwards different GPU, the perf behaves ok.

Question1: what's major cause for such big contention or perf degrade when concurrent gdr_copy_to_mapping? considering 32KB is not large enough I don't think PCIe bandwith is saturated.

Question2: any plan or possible to optimize concurrent gdr_copy_to_mapping?

Thanks for any feedback.

Oct 19 '22 03:10 Zhaojp-Frank

Hi @Zhaojp-Frank ,

I would like to know more about your setup before we drive deeper. Some questions are just to make sure that we have already eliminated external factors.

You said that you bound 2 processes to different cores. Were they on the same or different CPU sockets?
Did you also bind the host memory to the core -- e.g., by using numactl -l?
How did you make sure that gdr_copy_to_mapping of both processes ran concurrently? Starting both processes at the same time does not always mean they will reach the test section at the same time.

each process gdr_copy_to_mapping gets avg 6.2usec

Does this number come from averaging the latency of both processes? Or Process A showed 6.2 us and Process B also showed 6.2 us?
How many iterations did you run?
Did you put the GPU clocks to max? Did you also lock the CPU clock?
Can you provide the PCIe topology?

Oct 20 '22 05:10 pakmarkthub

in general, do you think it's abnormal (not by design)? I think you may quickly reproduce it by simpliy modifying tests with fixed size (i.e., 32KB/64KB), and disable gdr_copy_from_mapping test. see attatched file for reference. then launch two process in background CUDA_VISIBLE_DEVICES=0 numactl -l ./copylat -w 10000 -r 0 -s 65536 & CUDA_VISIBLE_DEVICES=1 numactl -l ./copylat -w 10000 -r 0 -s 65536 &

more info input:

I tried with different core (e.g., core 0, 1) from same socket and different socket (i.e., core0, core 48), looks no obvious differences. there're 48core *2
tried with numactl -l, no big difference
yeah, I just running the processes in background with large enough iterations like 10000 what's more, the same setting get pretty good perf if towards diff GPU 0, 1. so accruate concurrent or not sounds not a big deal
the latter case, each process report similar perf at ~6.2usec (which avg the iterations within the process)
set as 10000 iterations
no, i have not set/reset any GPU/cpu clock
nvidia-smi topo -m output $nvidia-smi topo -m GPU0 GPU1 CPU Affinity NUMA Affinity GPU0 X SYS 0-95 N/A GPU1 SYS X 0-95 N/A

$lspci -tv|grep -i nvidia -+-[0000:d7]-+-00.0-[d8]----00.0 NVIDIA Corporation Device 1eb8 +-[0000:5d]-+-00.0-[5e]----00.0 NVIDIA Corporation Device 1eb8

diff --- copylat-orig.cpp 2022-10-17 22:37:29.944080142 +0800 +++ copylat-simple.cpp 2022-10-20 15:07:44.764855950 +0800 @@ -253,11 +253,10 @@ int main(int argc, char *argv[]) // gdr_copy_to_mapping benchmark cout << endl; cout << "gdr_copy_to_mapping num iters for each size: " << num_write_iters << endl;

   cout << "WARNING: Measuring the API invocation overhead as observed by the CPU. Data might not be ordered all the way to the GPU internal visibility." << endl;
   // For more information, see
   // https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#sync-behavior
   printf("Test \t\t\t Size(B) \t Avg.Time(us)\n");

```
   copy_size = 1;
```

   copy_size = size;
   while (copy_size <= size) {
       int iter = 0;
       clock_gettime(MYCLOCK, &beg);

@@ -276,6 +275,7 @@ int main(int argc, char *argv[]) MB();

     // gdr_copy_from_mapping benchmark

   /*
   cout << endl;
   cout << "gdr_copy_from_mapping num iters for each size: " << num_read_iters << endl;
   printf("Test \t\t\t Size(B) \t Avg.Time(us)\n");

@@ -290,6 +290,7 @@ int main(int argc, char *argv[]) printf("gdr_copy_from_mapping \t %8zu \t %11.4f\n", copy_size, lat_us); copy_size <<= 1; }

```
   */
```

Oct 20 '22 07:10 Zhaojp-Frank

for size=64KB,

if targets two gpu, each process reports ~6.4usec, if both process with CUDA_VISIBLE_DEVICES=0, each process report ~12.6usec

CUDA_VISIBLE_DEVICES=0 numactl -l ./copylat -w 10000 -r 0 -s 65536 & CUDA_VISIBLE_DEVICES=1 numactl -l ./copylat -w 10000 -r 0 -s 65536 &

Oct 20 '22 07:10 Zhaojp-Frank

GDRCopy, by design, is for low latency CPU-GPU communication at small message sizes. It uses CPU to drive the communication -- as opposed to cudaMemcpy which uses the GPU copy engine. In many systems, GDRCopy cannot reach the peak BW while cudaMemcpy can. To understand what GDRCopy can deliver on your system, I suggest that you run copybw at various message sizes and plot the BW graph. You might find out that you have already reached the bw limit at that message size.

On your system, write combining (WC) is likely enabled. WC uses the CPU WC buffer to absorb small messages and flushes out one large PCIe packet. This helps with the performance. However, the WC buffer size and how the buffer is shared across cores depend on the CPU.

Putting a process on a far socket can increase the latency. This is because the transactions need to be forwarded through the CPU-CPU link. And that can also cause interference with transactions that originate from the near socket.

I recommend setting the GPU clocks (SM and memory) to max. Otherwise, the GPU internal subsystem may operate at a lower frequency, which delays the response time. Setting the CPU clock to max is also recommended because CPU is driving the communication. I don't think this is the root cause, however. Using the default clocks should not cause the latency to double.

Oct 20 '22 08:10 pakmarkthub

thanks for you insight sharing, indeed I actually do care latency rather than BW in this test case.

Your comment on WC makes great sense. indeed it's enabled (shown in the map info output).

I just want to validate WC impact on latency, so do u know how to disable WC effect such as on specific dev range? using another avx instructions (rather that stream**?)

Oct 20 '22 12:10 Zhaojp-Frank

WC mapping is enabled in the gdrdrv driver. You can comment out these lines to disable it (https://github.com/NVIDIA/gdrcopy/blob/master/src/gdrdrv/gdrdrv.c#L1190-L1197). The default on x86 should be uncached (UC) mapping. You probably see higher latency with UC with the sizes that you mentioned.

Oct 21 '22 02:10 pakmarkthub

well, If I comment out WC enabling, the perf no mather single or two processes, the latency is terrable, 220+ usec it doesn't resolve contention problem but make things worse.

wondering other clue to improve concurrent gdr_copy_to_mapping

Oct 21 '22 07:10 Zhaojp-Frank

Have you already measured the BW? If you are limited by the BW, there is nothing much we can do. As mentioned, the peak BW GDRCopy can achieved can be lower than the PCIe BW on your system.

You may be able to get a bit more performance when playing with the copy algorithm. Depending on the system (CPU, topology, and other factors), changing the algorithm from AVX to something else might help. But I don't expect it to completely solve your problem about experiencing double latency when using two processes.

Oct 21 '22 08:10 pakmarkthub

Ok, I'll measure BW as well and post it later

Oct 23 '22 07:10 Zhaojp-Frank