gdrcopy Inconsistent gdr read latency

I have a memory allocator based on gdr which initially allocates a large chunk of gdr memory (e.g. 16 MB) and then allocates pieces of this chunk to the subsequent memory requests. During performance benchmarking, I noticed that the read latency of the same memory size fluctuates quite significantly and I can't understand why. For example, if I allocate 3KB memory read 100 times and do the same thing again and again, the average read time fluctuates between 4.5 us and 70 us (i.e. 4.5 -> 70 -> 4.5 -> 70 ...) even though the same piece of memory is allocated for every 100 reads.

Here are some details regarding my settings:

I have set CU_POINTER_ATTRIBUTE_SYNC_MEMOPS to 1 for the entire chunk using cuPointerSetAttribute.
The api reports "using SSE4_1 implementation of gdr_copy_from_bar" for read operation.
The data is read into page-locked host memory allocated using cudaMallocHost.
Both the gdr-mapped source pointer and the destination pointer are 128-bit aligned.
The write latency is quite consistent.

Jan 18 '23 16:01 arianmag

Hi @anaanimous,

CPU and GPU clocks are usually the main cause (but not always) of performance fluctuation. Can you try the items below and rerun your test again?

Fix the CPU clock or at least set your power governance to "performance" sudo cpupower frequency-set -g performance.
Please also set the GPU clocks to max.

# To view the max clock values of GPU 0
$ nvidia-smi -i 0 -q
...
    Max Clocks
        Graphics                          : 1410 MHz
        SM                                : 1410 MHz
        Memory                            : 1593 MHz
        Video                             : 1290 MHz
...

# To set the clocks of GPU 0 to max
$ sudo nvidia-smi -i 0 -ac 1539,1410

Jan 19 '23 02:01 pakmarkthub

Thank you for the quick response.

I have been running the benchmark on an AWS instance. However, today after running the same program on a local server, the performance has been stable. I don't know what's causing the fluctuation on AWS. It may be due to frequency scaling but I doubt it. Here is why:

I have a benchmark program where I measure the read latency for different sizes (1, 2, 4, ..., 1MB), similar to the copylat program, except that I use my allocator and its APIs to allocate memory and perform the reading. If I run this program on a local server the performance nicely matches that of the copylat. But on AWS the read latency for sizes above 512 bytes suddenly increases significantly (e.g. the latency of reading 512 bytes goes from 1.5 us to 12 us). But strangely enough, if I only skip the one-byte read (i.e. if I perform the test for 2, 4, ..., 1MB instead of 1, 2,... 1MB) the numbers match the copylat output.

Jan 19 '23 15:01 arianmag

Let's split into two topics here. The first one is the performance fluctuation, which seems to be resolved now. Depending on how your instance is allocated, I guess that you might share the host with other instances. I cannot say much about the performance predictability if you are not in full control of the entire system. There are so many external factors that can affect the performance.

The second topic is about the reading latency jumps to 12 us when reading 512 bytes. Can you share the code? I will try to reproduce this behavior on our system.

Jan 20 '23 01:01 pakmarkthub