openmm copied to clipboard
Considerations on project size (atom count) and AMD (GCN) performance
Over the last time I have been trying to analyze the low performance of AMD GPUs (GCN) on small projects, trying to find improvements. Now I'm considering a possible reason for the observation of low performance on small atom count simulations. That observation - if confirmed - could help project owners to systematically decide which projects will have reasonable performance on AMD GCN-type GPUs and which not, thus helping to decide which projects to assign to different GPUs.
AMD GCN has a wavefront size of 64 threads, however these 64 threads are not executed on a full compute unit (CU) in parallel, but on one SIMD unit (out of 4 in a CU) sequentialized into four consecutive 16-thread parts, thus taking 4 cycles for a wavefront. In a CU there are 4 SIMD units, the Radeon VII (my case) has 60 CUs. As a conclusion, to occupy the GPU completely, not 3840 threads are required, as that would seem obvious (wavefront size 64 * 60 CUs), but 15360 threads (wavefront size 64 * (4 SIMDs/CU) * 60 CUs) and needing 4 cycles. That construction makes the architecture very wide, wider than even the NVidia RTX3090, but executing that wide thread count not in one, but four cycles (effectively 3840 threads per cycle).
If my conclusion is correct, that would explain why AMD GCN devices are much more sensitive to small projects than even the widest NVidia GPUs. I have to confirm that I don't really know how NVidia handles it's warps, but as far as I understand, they are executed in parallel (not partially serialized like AMD GCN).
If we correlate the above thoughts with an excerpt of kernel call characteristics for a sample project 16921(79,24,68)
Executing Kernel computeNonbonded, workUnits 61440, blockSize 64, size 61440
Executing Kernel reduceForces, workUnits 21024, blockSize 128, size 21120
Executing Kernel integrateLangevinPart1, workUnits 21016, blockSize 64, size 21056
Executing Kernel applySettleToPositions, workUnits 6810, blockSize 64, size 6848
Executing Kernel applyShakeToPositions, workUnits 173, blockSize 64, size 192
Executing Kernel integrateLangevinPart2, workUnits 21016, blockSize 64, size 21056
Executing Kernel clearFourBuffers, workUnits 320760, blockSize 128, size 122880
Executing Kernel findBlockBounds, workUnits 21016, blockSize 64, size 21056
Executing Kernel sortShortList, workUnits 256, blockSize 256, size 256
Executing Kernel sortBoxData, workUnits 21016, blockSize 64, size 21056
Executing Kernel findBlocksWithInteractions, workUnits 21016, blockSize 256, size 21248
Executing Kernel updateBsplines, workUnits 21016, blockSize 64, size 21056
Executing Kernel computeRange, workUnits 256, blockSize 256, size 256
Executing Kernel assignElementsToBuckets, workUnits 21016, blockSize 64, size 21056
Executing Kernel computeBucketPositions, workUnits 256, blockSize 256, size 256
Executing Kernel copyDataToBuckets, workUnits 21016, blockSize 64, size 21056
Executing Kernel sortBuckets, workUnits 21120, blockSize 128, size 21120
Executing Kernel gridSpreadCharge, workUnits 21016, blockSize 64, size 21056
Executing Kernel finishSpreadCharge, workUnits 157464, blockSize 64, size 61440
Executing Kernel packForwardData, workUnits 78732, blockSize 64, size 61440
Executing Kernel execFFT, workUnits 78732, blockSize 108, size 78732
Executing Kernel execFFT, workUnits 78732, blockSize 108, size 78732
Executing Kernel execFFT, workUnits 78732, blockSize 108, size 78732
Executing Kernel unpackForwardData, workUnits 78732, blockSize 64, size 61440
Executing Kernel gridEvaluateEnergy, workUnits 157464, blockSize 64, size 61440
Executing Kernel reciprocalConvolution, workUnits 157464, blockSize 64, size 61440
Executing Kernel packBackwardData, workUnits 78732, blockSize 64, size 61440
Executing Kernel execFFT, workUnits 78732, blockSize 108, size 78732
Executing Kernel execFFT, workUnits 78732, blockSize 108, size 78732
Executing Kernel execFFT, workUnits 78732, blockSize 108, size 78732
Executing Kernel unpackBackwardData, workUnits 78732, blockSize 64, size 61440
Executing Kernel gridInterpolateForce, workUnits 21016, blockSize 64, size 21056
Executing Kernel computeBondedForces, workUnits 23455, blockSize 64, size 23488
we see that many kernels of that project run with 21016 threads (the project is specified with 21000 atoms in the project list). Assuming that not different kernels are executed in parallel, that means that the first 15360 threads will fully occupy the GPU, but the remaining 5656 threads only with about 1/3 of the capacity.
Other projects, like RUN9 of the benchmark projects (with 4071 atoms), will load the Radeon VII only at about 25% for many of the important kernels.
On the opposite in large projects the part of ineffective GPU load will create a small share only, that's where in general good performance of GCN devices is observed.
As a conclusion projects with slightly less than x * effective thread width (15360 for Radeon VII, gfx906) should run well, projects slightly above will run rather poor (specially if not covered by a very high atom count). So a project with 15000 atoms probably will run well, a project with 16000 atoms rather poor. For GPU use optimization that relation probably should be considered besides a general classification of a project to a group of GPUs.
Critical cross check of above thoughts is welcome!
Can you test it and see? You can use the script in to generate and time simulations with any number of atoms you want. Do you find a large change between 15,000 and 16,000?
As a conclusion, to occupy the GPU completely, not 3840 threads are required, as that would seem obvious (wavefront size 64 * 60 CUs), but 15360 threads (wavefront size 64 * (4 SIMDs/CU) * 60 CUs) and needing 4 cycles.
Ideally you want even more than that. GPUs rely on having lots of threads that they can switch between to cover latency. If your number of threads exactly matches the width of the GPU, each compute unit will be busy until it needs to load something from global memory, and then it will stall for a few hundred clock cycles while waiting for the data to arrive. You try to cover this by having other threads it can switch to while the first thread is stuck waiting for data. When possible, we set the number of workgroups to a multiple of the number of compute units:
In the case above, though, we cap it at a lower value because the number of atoms is so low. There just isn't enough work to fill that many workgroups.
Efficiently using a high end GPU to simulate a small system is always going to be challenging (and beyond a certain point, impossible). It's best if we can target the small projects to low end GPUs, keeping the high end GPUs free for the large systems that make good use of them.
Can you test it and see? You can use the script in openmm#2875 (comment) to generate and time simulations with any number of atoms you want. Do you find a large change between 15,000 and 16,000?
After writing that, it sprang to my mind that I could check that with the figures we had from our test runs. Interestingly it showed up different, from 15000 to 16000 actually the overall time dropped. Just trying to analyze that with traces from the runs, don't understand it yet.
Ideally you want even more than that. GPUs rely on having lots of threads that they can switch between to cover latency. If your number of threads exactly matches the width of the GPU, each compute unit will be busy until it needs to load something from global memory, and then it will stall for a few hundred clock cycles while waiting for the data to arrive. You try to cover this by having other threads it can switch to while the first thread is stuck waiting for data. When possible, we set the number of workgroups to a multiple of the number of compute units:
I have just studied the threads you put into the different kernels and wondered why nonbonded calculation had a fixed size. Understood that you choose 4*15360 as fixed size. The reduced time for 16000 compared to 15000 actually might result from the fact that - with onefold occupation only - the addition to 16000 somewhat reduces the starving. On higher multiples of 15360 that effect does not show, in that table we have a serious increase in time when we exceed triple occupancy:
45792 | 1431 | 5.13 | 4.98 50442 | 1577 | 6.13 | 5.99
So probably in general the idea was right, but I did not consider the "starved" issue. Probably exceeding 15360 first comes at no cost at all, since the additional work is done while the GPU anyhow would be waiting. On higher multiples, when the GPU would not loose a lot of cycles waiting, the step beyond a multiple of the effective thread size really adds additional time and reduces efficiency.
Efficiently using a high end GPU to simulate a small system is always going to be challenging (and beyond a certain point, impossible). It's best if we can target the small projects to low end GPUs, keeping the high end GPUs free for the large systems that make good use of them.
Agreed, but I see many small projects (like those example with 21000 atoms) assigned to my GPUs, delivering low performance. And with GPUs getting larger and larger, probably F@H will not have enough small GPUs for some projects in the future. That then will be the time to run simulations in parallel.
Sorry, if as a newbie I come up with ideas that you for long time know are nonsense. But sometimes it still seems helpful to keep eyes open and put in ideas, so just let's discard those that are nonsense.
I'm just trying to understand the traces and why according to them about 1/3 of the time the GPU is idle, kernels are active only 2/3 of the time. First I considered it would be kernel schedule latency, idle time after one kernel is finished until the next is started. However if I look at the time CompleteNS - DispatchNS of the next kernel, it seems that a number of kernels are dispatched ahead, before the one before completes, since this difference is negative - that's the column before StartLat. Still there are (in most cases) about 4us between end of one kernel and start of the next: column Idle is the difference between BeginNS and EndNS of the kernel before.
Or are kernels dispatched in advance by the application, but the runtime system (maybe in the linux kernel driver) waits until a kernel finished before it executes the next one, which had been scheduled earlier by the application?
Probably exceeding 15360 first comes at no cost at all, since the additional work is done while the GPU anyhow would be waiting. On higher multiples, when the GPU would not loose a lot of cycles waiting, the step beyond a multiple of the effective thread size really adds additional time and reduces efficiency.
That sounds right.
Still there are (in most cases) about 4us between end of one kernel and start of the next
If so, they've improved! AMD used to have an overhead of about 10 us for each kernel. That may still be the case on Windows, which has higher overhead than Linux. This is why we try to minimize the number of separate kernel launches.
Agreed, but I see many small projects (like those example with 21000 atoms) assigned to my GPUs, delivering low performance.
Yes, we definitely need to do something about that!
BTW, if it is of interest to you, here are stats for runs with width 5.25 (longer runtime), the run with atom count just below the limit, and with width 5.5 (shorter runtime), just above the limit. The difference lies mostly in computeNonbonded, which runs with fixed thread size.
The test with higher granularity. In general it shows over proportional increase in time when the atom size crosses a multiple of the maximum thread size of the GPU. However it is lower than expected and has an irregularity at the first step, where it would be most critical. However that irregularity seems to come from calculateNonbonded, which has fixed thread size and as such does not look related to that point. In general some disturbances/noise in the measured values, like for example on atom count 113958, time deltas are not "smooth".
List 1 | List 2 | Time | dTime |
12255 | 383 | 2,91248488426208 | |
12990 | 406 | 2,88024377822876 | -0,032241106033325 |
13836 | 433 | 3,05391716957092 | 0,173673391342163 |
14514 | 454 | 3,02331209182739 | -0,03060507774353 |
15477 | 484 | 2,71697306632996 | -0,306339025497437 |
16272 | 509 | 2,79023265838623 | 0,073259592056275 |
17178 | 537 | 2,82571578025818 | 0,035483121871948 |
18138 | 567 | 2,97455453872681 | 0,148838758468628 |
19026 | 595 | 3,0194935798645 | 0,044939041137695 |
20340 | 636 | 3,12287950515747 | 0,103385925292969 |
21384 | 669 | 3,20537495613098 | 0,082495450973511 |
21480 | 672 | 3,13328719139099 | -0,07208776473999 |
23568 | 737 | 3,30896997451782 | 0,175682783126831 |
24498 | 766 | 3,38099765777588 | 0,072027683258057 |
25572 | 800 | 3,47666311264038 | 0,095665454864502 |
26964 | 843 | 3,50756669044495 | 0,030903577804566 |
28179 | 881 | 3,61467909812927 | 0,107112407684326 |
29424 | 920 | 3,70178055763245 | 0,087101459503174 |
30789 | 963 | 3,94584131240845 | 0,244060754776001 |
32325 | 1011 | 4,00474500656128 | 0,058903694152832 |
33840 | 1058 | 4,01952767372131 | 0,014782667160034 |
35211 | 1101 | 4,11354827880859 | 0,09402060508728 |
36708 | 1148 | 4,28623414039612 | 0,172685861587524 |
38217 | 1195 | 4,39107799530029 | 0,104843854904175 |
39825 | 1245 | 4,42315292358398 | 0,032074928283691 |
41358 | 1293 | 4,66109704971314 | 0,23794412612915 |
43305 | 1354 | 4,88815975189209 | 0,227062702178955 |
44958 | 1405 | 4,89233231544495 | 0,004172563552856 |
46629 | 1458 | 5,56133365631104 | 0,669001340866089 |
48441 | 1514 | 5,58660101890564 | 0,025267362594605 |
50415 | 1576 | 5,90663361549377 | 0,320032596588135 |
52341 | 1636 | 6,04964709281921 | 0,143013477325439 |
54444 | 1702 | 6,23829650878906 | 0,188649415969849 |
56142 | 1755 | 6,32623052597046 | 0,087934017181397 |
58515 | 1829 | 6,49991869926453 | 0,173688173294067 |
60414 | 1888 | 6,56362867355347 | 0,063709974288941 |
62562 | 1956 | 6,86830520629883 | 0,304676532745361 |
64830 | 2026 | 6,90527391433716 | 0,03696870803833 |
66906 | 2091 | 7,02717351913452 | 0,121899604797363 |
69921 | 2186 | 7,35353469848633 | 0,326361179351807 |
72279 | 2259 | 7,55564570426941 | 0,202111005783081 |
72495 | 2266 | 7,51527237892151 | -0,0403733253479 |
77148 | 2411 | 7,8772234916687 | 0,361951112747192 |
79191 | 2475 | 7,98567771911621 | 0,10845422744751 |
81540 | 2549 | 8,17700910568237 | 0,191331386566162 |
84534 | 2642 | 8,2921895980835 | 0,115180492401123 |
87087 | 2722 | 8,45307517051697 | 0,160885572433472 |
89712 | 2804 | 8,62603664398193 | 0,172961473464966 |
92592 | 2894 | 9,14751052856445 | 0,52147388458252 |
95766 | 2993 | 9,43603682518005 | 0,288526296615601 |
98880 | 3090 | 9,76397943496704 | 0,327942609786987 |
101697 | 3179 | 10,0979392528534 | 0,333959817886353 |
104721 | 3273 | 10,1487267017365 | 0,050787448883057 |
107736 | 3367 | 10,4614281654358 | 0,312701463699341 |
110937 | 3467 | 10,7668445110321 | 0,305416345596313 |
113958 | 3562 | 11,3086459636688 | 0,541801452636719 |
117774 | 3681 | 11,4619560241699 | 0,153310060501099 |
120948 | 3780 | 11,6951687335968 | 0,23321270942688 |
124188 | 3881 | 12,1348485946655 | 0,439679861068726 |
127650 | 3990 | 12,4355757236481 | 0,300727128982544 |
Could an out-of-order command queue with explicit synchronization avoid the gap between the execution of subsequent kernels? Is that worth experimenting with? If the gaps could be avoided, for short running kernels theoretically about 50% (or more) of additional time could be achieved, execution time cut by 1/3 to 1/2. Or do I miss something?
Could an out-of-order command queue with explicit synchronization avoid the gap between the execution of subsequent kernels?
The gap reflects the work the driver and GPU have to do to set up a new kernel. It's independent of any synchronization between kernels.
CUDA 10 introduced a new mechanism that can precompute some of that work to reduce the overhead. See Some day when we move to CUDA 10 as our minimum supported version I hope to be able to make use of it. But there's no similar mechanism in OpenCL.
According to documentation in out-of-order execution kernels should run in parallel, so the next kernel might already be set up and started while the previous is still running, thus closing/avoiding that gap. But I tried and on my platform it does not seem to work, kernels are not started in parallel, but one after the other :(.
@ThWuensche Your trace has a significant amount of time in findBlocksWithInteractions
. Can you try changing 32 to 64 in to check if performance is better or worse?
#if SIMD_WIDTH <= 32
That makes it spend even more time there. And more of a question, also the time in computeNonbonded is increasing. How could that be related to the blocksize in another kernel, should not have influence on the data? Probably worth to analyze, will see whether I'll understand what's going on. Does that buffer size have influence on the actual data, like more data to be computed in computeNonbonded?
Then in that kernel a lot is fixed on "32", also things depending on local_index. That probably lets two "warps" go into a "wavefront".
That makes it spend even more time there. And more of a question, also the time in computeNonbonded is increasing. How could that be related to the blocksize in another kernel, should not have influence on the data? Probably worth to analyze, will see whether I'll understand what's going on. Does that buffer size have influence on the actual data, like more data to be computed in computeNonbonded?
Changing the if
makes GCN use the same code path as NVIDIA and Navi. I wanted to confirm that the comments around L277 are still accurate for wide GCN GPUs.
Buffer size might be optimized to fit within local memory for NVIDIA, which is larger than GCN.
AMD Navi: 64KB
Just saw that when I looked into the code. It's not only size of that buffer, but selects a completely different implementation.
But why does that have influence on computeNonbonded?
Regarding the local memory LDS (local data share) on my version of GCN should be 64KB, don't know whether it was less on older GCN versions.
The other question is why these calculations (findBlocksWithInteractions
and computeNonbonded
) take more time for lower atom counts than for higher atom counts.
But why does that have influence on computeNonbonded?
I don't know enough about what's happening internally to know whether the kernels are stalling each other or the results of the two algorithms are slightly different because of things like repeated multiplications or partial sums.
Regarding the local memory LDS (local data share) on my version of GCN should be 64KB, don't know whether it was less on older GCN versions.
CompuBench reports 32KB for Radeon VII. RX 460 (GCN 4.0) and Spectre R7 Graphcis APU (GCN 2.0) have 32KB as well.
The other question is why these calculations (
) take more time for lower atom counts than for higher atom counts.
As a percentage of total execution time or average wall duration per call?
CompuBench reports 32KB for Radeon VII. RX 460 (GCN 4.0) and Spectre R7 Graphcis APU (GCN 2.0) have 32KB as well.
Think they got something wrong. My Radeon VII is gfx906, that's part of the output of clinfo:
Local memory size: 65536
Name: gfx906
Vendor: Advanced Micro Devices, Inc.
As a percentage of total execution time or average wall duration per call?
Both, I think. The simulated system has been created by a script from Peter Eastman in that issue. The traces above are from two runs of that script with atom count and run time according to lines 2 and 3 in the table, the run with the bigger atom count has about 8% lower runtime, correlating with the lower time for these two kernels in the traces.
Think they got something wrong. My Radeon VII is gfx906, that's part of the output of clinfo:
Local memory size: 65536
Interesting, is that the value for CL_DEVICE_LOCAL_MEM_SIZE
RX 460:
Local memory size 32768 (32KiB)
Local memory syze per CU (AMD) 65536 (64KiB)
This guide mentions LDS is 64KB with a maximum allocation of 32KB per workgroup.
Found this tidbit about Vega LDS size:
[C]urrently Vega has 64 KB local/shared memory enabled on Linux, but 32 KB on Windows.
I meant LDS and LDS is per CU. I was not aware of the limitation 32KB per workgroup.
I just tried, I can define a buffer:
__local uint[8192+4096]
and the compiler does not complain.
for (j=8192+4096-1; j>=0; j--)
buffer[j] = (uint)j;
out[0] = buffer[8192+4096-1];
out[0] comes up with 0x2fff, so guess I have more than 32KB.
That is with ROCm 3.7 on Linux.
Changing the if makes GCN use the same code path as NVIDIA and Navi. I wanted to confirm that the comments around L277 are still accurate for wide GCN GPUs.
They're still correct. On a GPU with a SIMD width of 64, that kernel has severe thread divergence.
And more of a question, also the time in computeNonbonded is increasing.
I don't know of any reason that would be affected.
I'm just reading computeNonbonded and thinking about thread divergence. There is the distinction between tiles on diagonal and off diagonal. If on a wavefront size of 64 two tiles are in one wavefront and one is on diagonal, one off diagonal, the wavefront probably would see rather large thread divergence, is that correct?
I don't know of any reason that would be affected.
The traces above seem to indicate effect onto compute time for computeNonbonded depending on that selection. Not to make false conclusions will recheck.
If on a wavefront size of 64 two tiles are in one wavefront and one is on diagonal, one off diagonal, the wavefront probably would see rather large thread divergence, is that correct?
Correct. That's why we sort the tiles differently on GPUs with SIMD width 64, to put all the diagonal ones together and all the off diagonal ones together.
I don't know of any reason that would be affected.
The traces above seem to indicate effect onto compute time for computeNonbonded depending on that selection. Not to make false conclusions will recheck.
Confirmed that it does make a difference. First trace is with standard findInteraction for AMD, second with algorithm for other platforms. Execution time of computeNonbonded gets significantly longer.
Correct. That's why we sort the tiles differently on GPUs with SIMD width 64, to put all the diagonal ones together and all the off diagonal ones together.
Thanks for the explanation. Had noticed that difference in the sort, but did not understand the meaning/consequences.
About the results from the test script: What gives me concerns is that from width 5.25 to width 5.5 the execution time drops by 0.2s, even though the system size increases. That difference comes just from these two kernels. That related with above effect leaves a question mark for me.
Try modifying the line in the script that creates the context to instead be
context = Context(system, integrator, Platform.getPlatformByName('OpenCL'), {'DisablePmeStream':'true'})
After you do that, do you still see a difference in the speed of the nonbonded kernel?
Difference remains:
I will try to include trace markers to find where it happens. Edit: Markers in roctracer seem to be related to host code, not kernel code. So probably that way does not exist, don't know how to extract runtime of parts of kernel code, like runtime of a loop.