Question on bad perf when concurrent copy on single GPU
we observe bad latency if concurrent copy_to_mapping on the single GPU, and want to understand the cause (known limit?) before we dive into.
- env: X86 8163 2 socket (AVX2 supported), Nvidia T4 *2, PCIe3.0 * 16, CUDA driver 450.82, latest gdrcpy(2022.10)
- Tests: 2 processes (bind to different core) concurrently running test/copylat on GPU0, each process alloc different host and dev memory addr of course.
- result: i.e. 32KB, each process gdr_copy_to_mapping gets avg 6.2usec, vs. 3.2usec if run with single process. similar problem with other block size (such as 2KB ~ 256KB where I only focuse on small blocks) btw, if 2 processes run torwards different GPU, the perf behaves ok.
Question1: what's major cause for such big contention or perf degrade when concurrent gdr_copy_to_mapping? considering 32KB is not large enough I don't think PCIe bandwith is saturated.
Question2: any plan or possible to optimize concurrent gdr_copy_to_mapping?
Thanks for any feedback.
Hi @Zhaojp-Frank ,
I would like to know more about your setup before we drive deeper. Some questions are just to make sure that we have already eliminated external factors.
- You said that you bound 2 processes to different cores. Were they on the same or different CPU sockets?
- Did you also bind the host memory to the core -- e.g., by using
numactl -l? - How did you make sure that
gdr_copy_to_mappingof both processes ran concurrently? Starting both processes at the same time does not always mean they will reach the test section at the same time.
each process gdr_copy_to_mapping gets avg 6.2usec
- Does this number come from averaging the latency of both processes? Or Process A showed 6.2 us and Process B also showed 6.2 us?
- How many iterations did you run?
- Did you put the GPU clocks to max? Did you also lock the CPU clock?
- Can you provide the PCIe topology?
in general, do you think it's abnormal (not by design)? I think you may quickly reproduce it by simpliy modifying tests with fixed size (i.e., 32KB/64KB), and disable gdr_copy_from_mapping test. see attatched file for reference. then launch two process in background CUDA_VISIBLE_DEVICES=0 numactl -l ./copylat -w 10000 -r 0 -s 65536 & CUDA_VISIBLE_DEVICES=1 numactl -l ./copylat -w 10000 -r 0 -s 65536 &
more info input:
- I tried with different core (e.g., core 0, 1) from same socket and different socket (i.e., core0, core 48), looks no obvious differences. there're 48core *2
- tried with numactl -l, no big difference
- yeah, I just running the processes in background with large enough iterations like 10000 what's more, the same setting get pretty good perf if towards diff GPU 0, 1. so accruate concurrent or not sounds not a big deal
- the latter case, each process report similar perf at ~6.2usec (which avg the iterations within the process)
- set as 10000 iterations
- no, i have not set/reset any GPU/cpu clock
- nvidia-smi topo -m output $nvidia-smi topo -m GPU0 GPU1 CPU Affinity NUMA Affinity GPU0 X SYS 0-95 N/A GPU1 SYS X 0-95 N/A
$lspci -tv|grep -i nvidia -+-[0000:d7]-+-00.0-[d8]----00.0 NVIDIA Corporation Device 1eb8 +-[0000:5d]-+-00.0-[5e]----00.0 NVIDIA Corporation Device 1eb8
diff --- copylat-orig.cpp 2022-10-17 22:37:29.944080142 +0800 +++ copylat-simple.cpp 2022-10-20 15:07:44.764855950 +0800 @@ -253,11 +253,10 @@ int main(int argc, char *argv[]) // gdr_copy_to_mapping benchmark cout << endl; cout << "gdr_copy_to_mapping num iters for each size: " << num_write_iters << endl;
-
cout << "WARNING: Measuring the API invocation overhead as observed by the CPU. Data might not be ordered all the way to the GPU internal visibility." << endl; // For more information, see // https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#sync-behavior printf("Test \t\t\t Size(B) \t Avg.Time(us)\n"); -
copy_size = 1;
-
copy_size = size; while (copy_size <= size) { int iter = 0; clock_gettime(MYCLOCK, &beg);
@@ -276,6 +275,7 @@ int main(int argc, char *argv[]) MB();
// gdr_copy_from_mapping benchmark
-
/* cout << endl; cout << "gdr_copy_from_mapping num iters for each size: " << num_read_iters << endl; printf("Test \t\t\t Size(B) \t Avg.Time(us)\n");
@@ -290,6 +290,7 @@ int main(int argc, char *argv[]) printf("gdr_copy_from_mapping \t %8zu \t %11.4f\n", copy_size, lat_us); copy_size <<= 1; }
-
*/
for size=64KB,
if targets two gpu, each process reports ~6.4usec, if both process with CUDA_VISIBLE_DEVICES=0, each process report ~12.6usec
CUDA_VISIBLE_DEVICES=0 numactl -l ./copylat -w 10000 -r 0 -s 65536 & CUDA_VISIBLE_DEVICES=1 numactl -l ./copylat -w 10000 -r 0 -s 65536 &
GDRCopy, by design, is for low latency CPU-GPU communication at small message sizes. It uses CPU to drive the communication -- as opposed to cudaMemcpy which uses the GPU copy engine. In many systems, GDRCopy cannot reach the peak BW while cudaMemcpy can. To understand what GDRCopy can deliver on your system, I suggest that you run copybw at various message sizes and plot the BW graph. You might find out that you have already reached the bw limit at that message size.
On your system, write combining (WC) is likely enabled. WC uses the CPU WC buffer to absorb small messages and flushes out one large PCIe packet. This helps with the performance. However, the WC buffer size and how the buffer is shared across cores depend on the CPU.
Putting a process on a far socket can increase the latency. This is because the transactions need to be forwarded through the CPU-CPU link. And that can also cause interference with transactions that originate from the near socket.
I recommend setting the GPU clocks (SM and memory) to max. Otherwise, the GPU internal subsystem may operate at a lower frequency, which delays the response time. Setting the CPU clock to max is also recommended because CPU is driving the communication. I don't think this is the root cause, however. Using the default clocks should not cause the latency to double.
thanks for you insight sharing, indeed I actually do care latency rather than BW in this test case.
Your comment on WC makes great sense. indeed it's enabled (shown in the map info output).
I just want to validate WC impact on latency, so do u know how to disable WC effect such as on specific dev range? using another avx instructions (rather that stream**?)
WC mapping is enabled in the gdrdrv driver. You can comment out these lines to disable it (https://github.com/NVIDIA/gdrcopy/blob/master/src/gdrdrv/gdrdrv.c#L1190-L1197). The default on x86 should be uncached (UC) mapping. You probably see higher latency with UC with the sizes that you mentioned.
well, If I comment out WC enabling, the perf no mather single or two processes, the latency is terrable, 220+ usec it doesn't resolve contention problem but make things worse.
wondering other clue to improve concurrent gdr_copy_to_mapping
Have you already measured the BW? If you are limited by the BW, there is nothing much we can do. As mentioned, the peak BW GDRCopy can achieved can be lower than the PCIe BW on your system.
You may be able to get a bit more performance when playing with the copy algorithm. Depending on the system (CPU, topology, and other factors), changing the algorithm from AVX to something else might help. But I don't expect it to completely solve your problem about experiencing double latency when using two processes.
Ok, I'll measure BW as well and post it later