openmm icon indicating copy to clipboard operation
openmm copied to clipboard

Considerations on project size (atom count) and AMD (GCN) performance

Open ThWuensche opened this issue 4 years ago • 48 comments

Over the last time I have been trying to analyze the low performance of AMD GPUs (GCN) on small projects, trying to find improvements. Now I'm considering a possible reason for the observation of low performance on small atom count simulations. That observation - if confirmed - could help project owners to systematically decide which projects will have reasonable performance on AMD GCN-type GPUs and which not, thus helping to decide which projects to assign to different GPUs.

AMD GCN has a wavefront size of 64 threads, however these 64 threads are not executed on a full compute unit (CU) in parallel, but on one SIMD unit (out of 4 in a CU) sequentialized into four consecutive 16-thread parts, thus taking 4 cycles for a wavefront. In a CU there are 4 SIMD units, the Radeon VII (my case) has 60 CUs. As a conclusion, to occupy the GPU completely, not 3840 threads are required, as that would seem obvious (wavefront size 64 * 60 CUs), but 15360 threads (wavefront size 64 * (4 SIMDs/CU) * 60 CUs) and needing 4 cycles. That construction makes the architecture very wide, wider than even the NVidia RTX3090, but executing that wide thread count not in one, but four cycles (effectively 3840 threads per cycle).

If my conclusion is correct, that would explain why AMD GCN devices are much more sensitive to small projects than even the widest NVidia GPUs. I have to confirm that I don't really know how NVidia handles it's warps, but as far as I understand, they are executed in parallel (not partially serialized like AMD GCN).

If we correlate the above thoughts with an excerpt of kernel call characteristics for a sample project 16921(79,24,68)

Executing Kernel computeNonbonded, workUnits 61440, blockSize 64, size 61440
Executing Kernel reduceForces, workUnits 21024, blockSize 128, size 21120
Executing Kernel integrateLangevinPart1, workUnits 21016, blockSize 64, size 21056
Executing Kernel applySettleToPositions, workUnits 6810, blockSize 64, size 6848
Executing Kernel applyShakeToPositions, workUnits 173, blockSize 64, size 192
Executing Kernel integrateLangevinPart2, workUnits 21016, blockSize 64, size 21056
Executing Kernel clearFourBuffers, workUnits 320760, blockSize 128, size 122880
Executing Kernel findBlockBounds, workUnits 21016, blockSize 64, size 21056
Executing Kernel sortShortList, workUnits 256, blockSize 256, size 256
Executing Kernel sortBoxData, workUnits 21016, blockSize 64, size 21056
Executing Kernel findBlocksWithInteractions, workUnits 21016, blockSize 256, size 21248
Executing Kernel updateBsplines, workUnits 21016, blockSize 64, size 21056
Executing Kernel computeRange, workUnits 256, blockSize 256, size 256
Executing Kernel assignElementsToBuckets, workUnits 21016, blockSize 64, size 21056
Executing Kernel computeBucketPositions, workUnits 256, blockSize 256, size 256
Executing Kernel copyDataToBuckets, workUnits 21016, blockSize 64, size 21056
Executing Kernel sortBuckets, workUnits 21120, blockSize 128, size 21120
Executing Kernel gridSpreadCharge, workUnits 21016, blockSize 64, size 21056
Executing Kernel finishSpreadCharge, workUnits 157464, blockSize 64, size 61440
Executing Kernel packForwardData, workUnits 78732, blockSize 64, size 61440
Executing Kernel execFFT, workUnits 78732, blockSize 108, size 78732
Executing Kernel execFFT, workUnits 78732, blockSize 108, size 78732
Executing Kernel execFFT, workUnits 78732, blockSize 108, size 78732
Executing Kernel unpackForwardData, workUnits 78732, blockSize 64, size 61440
Executing Kernel gridEvaluateEnergy, workUnits 157464, blockSize 64, size 61440
Executing Kernel reciprocalConvolution, workUnits 157464, blockSize 64, size 61440
Executing Kernel packBackwardData, workUnits 78732, blockSize 64, size 61440
Executing Kernel execFFT, workUnits 78732, blockSize 108, size 78732
Executing Kernel execFFT, workUnits 78732, blockSize 108, size 78732
Executing Kernel execFFT, workUnits 78732, blockSize 108, size 78732
Executing Kernel unpackBackwardData, workUnits 78732, blockSize 64, size 61440
Executing Kernel gridInterpolateForce, workUnits 21016, blockSize 64, size 21056
Executing Kernel computeBondedForces, workUnits 23455, blockSize 64, size 23488

we see that many kernels of that project run with 21016 threads (the project is specified with 21000 atoms in the project list). Assuming that not different kernels are executed in parallel, that means that the first 15360 threads will fully occupy the GPU, but the remaining 5656 threads only with about 1/3 of the capacity.

Other projects, like RUN9 of the benchmark projects (with 4071 atoms), will load the Radeon VII only at about 25% for many of the important kernels.

On the opposite in large projects the part of ineffective GPU load will create a small share only, that's where in general good performance of GCN devices is observed.

As a conclusion projects with slightly less than x * effective thread width (15360 for Radeon VII, gfx906) should run well, projects slightly above will run rather poor (specially if not covered by a very high atom count). So a project with 15000 atoms probably will run well, a project with 16000 atoms rather poor. For GPU use optimization that relation probably should be considered besides a general classification of a project to a group of GPUs.

Critical cross check of above thoughts is welcome!

ThWuensche avatar Oct 08 '20 19:10 ThWuensche

Can you test it and see? You can use the script in https://github.com/openmm/openmm/issues/2875#issuecomment-704521525 to generate and time simulations with any number of atoms you want. Do you find a large change between 15,000 and 16,000?

As a conclusion, to occupy the GPU completely, not 3840 threads are required, as that would seem obvious (wavefront size 64 * 60 CUs), but 15360 threads (wavefront size 64 * (4 SIMDs/CU) * 60 CUs) and needing 4 cycles.

Ideally you want even more than that. GPUs rely on having lots of threads that they can switch between to cover latency. If your number of threads exactly matches the width of the GPU, each compute unit will be busy until it needs to load something from global memory, and then it will stall for a few hundred clock cycles while waiting for the data to arrive. You try to cover this by having other threads it can switch to while the first thread is stuck waiting for data. When possible, we set the number of workgroups to a multiple of the number of compute units:

https://github.com/FoldingAtHome/openmm/blob/9998958af39c657d26da68aad0177fb8c45180a1/platforms/opencl/src/OpenCLContext.cpp#L236-L241

In the case above, though, we cap it at a lower value because the number of atoms is so low. There just isn't enough work to fill that many workgroups.

Efficiently using a high end GPU to simulate a small system is always going to be challenging (and beyond a certain point, impossible). It's best if we can target the small projects to low end GPUs, keeping the high end GPUs free for the large systems that make good use of them.

peastman avatar Oct 08 '20 20:10 peastman

Can you test it and see? You can use the script in openmm#2875 (comment) to generate and time simulations with any number of atoms you want. Do you find a large change between 15,000 and 16,000?

After writing that, it sprang to my mind that I could check that with the figures we had from our test runs. Interestingly it showed up different, from 15000 to 16000 actually the overall time dropped. Just trying to analyze that with traces from the runs, don't understand it yet.

Ideally you want even more than that. GPUs rely on having lots of threads that they can switch between to cover latency. If your number of threads exactly matches the width of the GPU, each compute unit will be busy until it needs to load something from global memory, and then it will stall for a few hundred clock cycles while waiting for the data to arrive. You try to cover this by having other threads it can switch to while the first thread is stuck waiting for data. When possible, we set the number of workgroups to a multiple of the number of compute units:

I have just studied the threads you put into the different kernels and wondered why nonbonded calculation had a fixed size. Understood that you choose 4*15360 as fixed size. The reduced time for 16000 compared to 15000 actually might result from the fact that - with onefold occupation only - the addition to 16000 somewhat reduces the starving. On higher multiples of 15360 that effect does not show, in that table we have a serious increase in time when we exceed triple occupancy:

45792 | 1431 | 5.13 | 4.98 50442 | 1577 | 6.13 | 5.99

So probably in general the idea was right, but I did not consider the "starved" issue. Probably exceeding 15360 first comes at no cost at all, since the additional work is done while the GPU anyhow would be waiting. On higher multiples, when the GPU would not loose a lot of cycles waiting, the step beyond a multiple of the effective thread size really adds additional time and reduces efficiency.

Efficiently using a high end GPU to simulate a small system is always going to be challenging (and beyond a certain point, impossible). It's best if we can target the small projects to low end GPUs, keeping the high end GPUs free for the large systems that make good use of them.

Agreed, but I see many small projects (like those example with 21000 atoms) assigned to my GPUs, delivering low performance. And with GPUs getting larger and larger, probably F@H will not have enough small GPUs for some projects in the future. That then will be the time to run simulations in parallel.

Sorry, if as a newbie I come up with ideas that you for long time know are nonsense. But sometimes it still seems helpful to keep eyes open and put in ideas, so just let's discard those that are nonsense.

ThWuensche avatar Oct 08 '20 21:10 ThWuensche

I'm just trying to understand the traces and why according to them about 1/3 of the time the GPU is idle, kernels are active only 2/3 of the time. First I considered it would be kernel schedule latency, idle time after one kernel is finished until the next is started. However if I look at the time CompleteNS - DispatchNS of the next kernel, it seems that a number of kernels are dispatched ahead, before the one before completes, since this difference is negative - that's the column before StartLat. Still there are (in most cases) about 4us between end of one kernel and start of the next: column Idle is the difference between BeginNS and EndNS of the kernel before.

grafik

Or are kernels dispatched in advance by the application, but the runtime system (maybe in the linux kernel driver) waits until a kernel finished before it executes the next one, which had been scheduled earlier by the application?

ThWuensche avatar Oct 08 '20 22:10 ThWuensche

Probably exceeding 15360 first comes at no cost at all, since the additional work is done while the GPU anyhow would be waiting. On higher multiples, when the GPU would not loose a lot of cycles waiting, the step beyond a multiple of the effective thread size really adds additional time and reduces efficiency.

That sounds right.

Still there are (in most cases) about 4us between end of one kernel and start of the next

If so, they've improved! AMD used to have an overhead of about 10 us for each kernel. That may still be the case on Windows, which has higher overhead than Linux. This is why we try to minimize the number of separate kernel launches.

Agreed, but I see many small projects (like those example with 21000 atoms) assigned to my GPUs, delivering low performance.

Yes, we definitely need to do something about that!

peastman avatar Oct 08 '20 22:10 peastman

BTW, if it is of interest to you, here are stats for runs with width 5.25 (longer runtime), the run with atom count just below the limit, and with width 5.5 (shorter runtime), just above the limit. The difference lies mostly in computeNonbonded, which runs with fixed thread size.

screenshot_w5_25 screenshot_w5_5

ThWuensche avatar Oct 08 '20 22:10 ThWuensche

The test with higher granularity. In general it shows over proportional increase in time when the atom size crosses a multiple of the maximum thread size of the GPU. However it is lower than expected and has an irregularity at the first step, where it would be most critical. However that irregularity seems to come from calculateNonbonded, which has fixed thread size and as such does not look related to that point. In general some disturbances/noise in the measured values, like for example on atom count 113958, time deltas are not "smooth".

List 1 List 2 Time dTime
12255 383 2,91248488426208
12990 406 2,88024377822876 -0,032241106033325
13836 433 3,05391716957092 0,173673391342163
14514 454 3,02331209182739 -0,03060507774353
15477 484 2,71697306632996 -0,306339025497437
16272 509 2,79023265838623 0,073259592056275
17178 537 2,82571578025818 0,035483121871948
18138 567 2,97455453872681 0,148838758468628
19026 595 3,0194935798645 0,044939041137695
20340 636 3,12287950515747 0,103385925292969
21384 669 3,20537495613098 0,082495450973511
21480 672 3,13328719139099 -0,07208776473999
23568 737 3,30896997451782 0,175682783126831
24498 766 3,38099765777588 0,072027683258057
25572 800 3,47666311264038 0,095665454864502
26964 843 3,50756669044495 0,030903577804566
28179 881 3,61467909812927 0,107112407684326
29424 920 3,70178055763245 0,087101459503174
30789 963 3,94584131240845 0,244060754776001
32325 1011 4,00474500656128 0,058903694152832
33840 1058 4,01952767372131 0,014782667160034
35211 1101 4,11354827880859 0,09402060508728
36708 1148 4,28623414039612 0,172685861587524
38217 1195 4,39107799530029 0,104843854904175
39825 1245 4,42315292358398 0,032074928283691
41358 1293 4,66109704971314 0,23794412612915
43305 1354 4,88815975189209 0,227062702178955
44958 1405 4,89233231544495 0,004172563552856
46629 1458 5,56133365631104 0,669001340866089
48441 1514 5,58660101890564 0,025267362594605
50415 1576 5,90663361549377 0,320032596588135
52341 1636 6,04964709281921 0,143013477325439
54444 1702 6,23829650878906 0,188649415969849
56142 1755 6,32623052597046 0,087934017181397
58515 1829 6,49991869926453 0,173688173294067
60414 1888 6,56362867355347 0,063709974288941
62562 1956 6,86830520629883 0,304676532745361
64830 2026 6,90527391433716 0,03696870803833
66906 2091 7,02717351913452 0,121899604797363
69921 2186 7,35353469848633 0,326361179351807
72279 2259 7,55564570426941 0,202111005783081
72495 2266 7,51527237892151 -0,0403733253479
77148 2411 7,8772234916687 0,361951112747192
79191 2475 7,98567771911621 0,10845422744751
81540 2549 8,17700910568237 0,191331386566162
84534 2642 8,2921895980835 0,115180492401123
87087 2722 8,45307517051697 0,160885572433472
89712 2804 8,62603664398193 0,172961473464966
92592 2894 9,14751052856445 0,52147388458252
95766 2993 9,43603682518005 0,288526296615601
98880 3090 9,76397943496704 0,327942609786987
101697 3179 10,0979392528534 0,333959817886353
104721 3273 10,1487267017365 0,050787448883057
107736 3367 10,4614281654358 0,312701463699341
110937 3467 10,7668445110321 0,305416345596313
113958 3562 11,3086459636688 0,541801452636719
117774 3681 11,4619560241699 0,153310060501099
120948 3780 11,6951687335968 0,23321270942688
124188 3881 12,1348485946655 0,439679861068726
127650 3990 12,4355757236481 0,300727128982544

ThWuensche avatar Oct 09 '20 07:10 ThWuensche

Could an out-of-order command queue with explicit synchronization avoid the gap between the execution of subsequent kernels? Is that worth experimenting with? If the gaps could be avoided, for short running kernels theoretically about 50% (or more) of additional time could be achieved, execution time cut by 1/3 to 1/2. Or do I miss something?

ThWuensche avatar Oct 09 '20 13:10 ThWuensche

Could an out-of-order command queue with explicit synchronization avoid the gap between the execution of subsequent kernels?

The gap reflects the work the driver and GPU have to do to set up a new kernel. It's independent of any synchronization between kernels.

CUDA 10 introduced a new mechanism that can precompute some of that work to reduce the overhead. See https://developer.nvidia.com/blog/cuda-graphs. Some day when we move to CUDA 10 as our minimum supported version I hope to be able to make use of it. But there's no similar mechanism in OpenCL.

peastman avatar Oct 09 '20 15:10 peastman

According to documentation in out-of-order execution kernels should run in parallel, so the next kernel might already be set up and started while the previous is still running, thus closing/avoiding that gap. But I tried and on my platform it does not seem to work, kernels are not started in parallel, but one after the other :(.

ThWuensche avatar Oct 09 '20 21:10 ThWuensche

@ThWuensche Your trace has a significant amount of time in findBlocksWithInteractions. Can you try changing 32 to 64 in findInteractingBlocks.cl to check if performance is better or worse?

#if SIMD_WIDTH <= 32

bdenhollander avatar Oct 10 '20 12:10 bdenhollander

That makes it spend even more time there. And more of a question, also the time in computeNonbonded is increasing. How could that be related to the blocksize in another kernel, should not have influence on the data? Probably worth to analyze, will see whether I'll understand what's going on. Does that buffer size have influence on the actual data, like more data to be computed in computeNonbonded?

Then in that kernel a lot is fixed on "32", also things depending on local_index. That probably lets two "warps" go into a "wavefront".

grafik

ThWuensche avatar Oct 10 '20 14:10 ThWuensche

That makes it spend even more time there. And more of a question, also the time in computeNonbonded is increasing. How could that be related to the blocksize in another kernel, should not have influence on the data? Probably worth to analyze, will see whether I'll understand what's going on. Does that buffer size have influence on the actual data, like more data to be computed in computeNonbonded?

Changing the if makes GCN use the same code path as NVIDIA and Navi. I wanted to confirm that the comments around L277 are still accurate for wide GCN GPUs.

Buffer size might be optimized to fit within local memory for NVIDIA, which is larger than GCN.

CL_DEVICE_LOCAL_MEM_SIZE
AMD GCN: 32KB
NVIDIA: 48KB
AMD Navi: 64KB

bdenhollander avatar Oct 10 '20 15:10 bdenhollander

Just saw that when I looked into the code. It's not only size of that buffer, but selects a completely different implementation.

But why does that have influence on computeNonbonded?

Regarding the local memory LDS (local data share) on my version of GCN should be 64KB, don't know whether it was less on older GCN versions.

ThWuensche avatar Oct 10 '20 15:10 ThWuensche

The other question is why these calculations (findBlocksWithInteractions and computeNonbonded) take more time for lower atom counts than for higher atom counts.

ThWuensche avatar Oct 10 '20 15:10 ThWuensche

But why does that have influence on computeNonbonded?

I don't know enough about what's happening internally to know whether the kernels are stalling each other or the results of the two algorithms are slightly different because of things like repeated multiplications or partial sums.

Regarding the local memory LDS (local data share) on my version of GCN should be 64KB, don't know whether it was less on older GCN versions.

CompuBench reports 32KB for Radeon VII. RX 460 (GCN 4.0) and Spectre R7 Graphcis APU (GCN 2.0) have 32KB as well.

bdenhollander avatar Oct 10 '20 16:10 bdenhollander

The other question is why these calculations (findBlocksWithInteractions and computeNonbonded) take more time for lower atom counts than for higher atom counts.

As a percentage of total execution time or average wall duration per call?

bdenhollander avatar Oct 10 '20 17:10 bdenhollander

CompuBench reports 32KB for Radeon VII. RX 460 (GCN 4.0) and Spectre R7 Graphcis APU (GCN 2.0) have 32KB as well.

Think they got something wrong. My Radeon VII is gfx906, that's part of the output of clinfo:

Local memory size:                             65536

...

Name:                                          gfx906
Vendor:                                        Advanced Micro Devices, Inc.

ThWuensche avatar Oct 10 '20 17:10 ThWuensche

As a percentage of total execution time or average wall duration per call?

Both, I think. The simulated system has been created by a script from Peter Eastman in that issue. The traces above are from two runs of that script with atom count and run time according to lines 2 and 3 in the table, the run with the bigger atom count has about 8% lower runtime, correlating with the lower time for these two kernels in the traces.

ThWuensche avatar Oct 10 '20 18:10 ThWuensche

Think they got something wrong. My Radeon VII is gfx906, that's part of the output of clinfo:

Local memory size:                             65536

Interesting, is that the value for CL_DEVICE_LOCAL_MEM_SIZE or CL_DEVICE_LOCAL_MEM_SIZE_PER_COMPUTE_UNIT_AMD?

RX 460:

Local memory size                               32768 (32KiB)
Local memory syze per CU (AMD)                  65536 (64KiB)

This guide mentions LDS is 64KB with a maximum allocation of 32KB per workgroup.

bdenhollander avatar Oct 10 '20 19:10 bdenhollander

Found this tidbit about Vega LDS size:

[C]urrently Vega has 64 KB local/shared memory enabled on Linux, but 32 KB on Windows.

https://community.amd.com/thread/246353

bdenhollander avatar Oct 10 '20 19:10 bdenhollander

I meant LDS and LDS is per CU. I was not aware of the limitation 32KB per workgroup.

I just tried, I can define a buffer:

__local uint[8192+4096]

and the compiler does not complain.

            for (j=8192+4096-1; j>=0; j--)
                buffer[j] = (uint)j;
            out[0] = buffer[8192+4096-1];

out[0] comes up with 0x2fff, so guess I have more than 32KB.

That is with ROCm 3.7 on Linux.

ThWuensche avatar Oct 10 '20 20:10 ThWuensche

Changing the if makes GCN use the same code path as NVIDIA and Navi. I wanted to confirm that the comments around L277 are still accurate for wide GCN GPUs.

They're still correct. On a GPU with a SIMD width of 64, that kernel has severe thread divergence.

And more of a question, also the time in computeNonbonded is increasing.

I don't know of any reason that would be affected.

peastman avatar Oct 10 '20 21:10 peastman

I'm just reading computeNonbonded and thinking about thread divergence. There is the distinction between tiles on diagonal and off diagonal. If on a wavefront size of 64 two tiles are in one wavefront and one is on diagonal, one off diagonal, the wavefront probably would see rather large thread divergence, is that correct?

ThWuensche avatar Oct 10 '20 21:10 ThWuensche

I don't know of any reason that would be affected.

The traces above seem to indicate effect onto compute time for computeNonbonded depending on that selection. Not to make false conclusions will recheck.

ThWuensche avatar Oct 10 '20 21:10 ThWuensche

If on a wavefront size of 64 two tiles are in one wavefront and one is on diagonal, one off diagonal, the wavefront probably would see rather large thread divergence, is that correct?

Correct. That's why we sort the tiles differently on GPUs with SIMD width 64, to put all the diagonal ones together and all the off diagonal ones together.

https://github.com/FoldingAtHome/openmm/blob/9998958af39c657d26da68aad0177fb8c45180a1/platforms/opencl/src/OpenCLNonbondedUtilities.cpp#L175-L187

peastman avatar Oct 10 '20 21:10 peastman

I don't know of any reason that would be affected.

The traces above seem to indicate effect onto compute time for computeNonbonded depending on that selection. Not to make false conclusions will recheck.

Confirmed that it does make a difference. First trace is with standard findInteraction for AMD, second with algorithm for other platforms. Execution time of computeNonbonded gets significantly longer.

grafik grafik

ThWuensche avatar Oct 10 '20 22:10 ThWuensche

Correct. That's why we sort the tiles differently on GPUs with SIMD width 64, to put all the diagonal ones together and all the off diagonal ones together.

Thanks for the explanation. Had noticed that difference in the sort, but did not understand the meaning/consequences.

ThWuensche avatar Oct 10 '20 22:10 ThWuensche

About the results from the test script: What gives me concerns is that from width 5.25 to width 5.5 the execution time drops by 0.2s, even though the system size increases. That difference comes just from these two kernels. That related with above effect leaves a question mark for me.

ThWuensche avatar Oct 10 '20 22:10 ThWuensche

Try modifying the line in the script that creates the context to instead be

context = Context(system, integrator, Platform.getPlatformByName('OpenCL'), {'DisablePmeStream':'true'})

After you do that, do you still see a difference in the speed of the nonbonded kernel?

peastman avatar Oct 11 '20 05:10 peastman

Difference remains: grafik grafik I will try to include trace markers to find where it happens. Edit: Markers in roctracer seem to be related to host code, not kernel code. So probably that way does not exist, don't know how to extract runtime of parts of kernel code, like runtime of a loop.

ThWuensche avatar Oct 11 '20 09:10 ThWuensche