dbcsr CP2K performs poorly on AMD platforms when using the DBCSR HIP backend.

I am writing to seek your assistance. When running CP2K simulations on the AMD MI50 platform with the DBCSR backend set to HIP, the execution time is longer than using the CPU. Using HIPprof to examine the API calls, I noticed that the results show a large number of H2D (Host to Device) transfers but no kernel launches. Normally, the call flow should be H2D -> LaunchKernel -> D2H (Device to Host). I would like to understand why there are so many H2D transfers and where in the code this is occurring. Below, I have attached the JSON file for you to open in chrome://tracing.

Thank you.

Jul 03 '24 09:07 Schroedingers0216

If possible, can you share the input file and perhaps the profile output when running the workload? The profile output is what contains the timings printed by CP2K at the end. What's clear already, this is not only about DBCSR but also CP2K's GRID components (collocate/integrate), perhaps even some PW, etc.

Regarding, "H2D -> LaunchKernel -> D2H" - this is idealized assuming only a single transfer/array is the input of such kernel and in turn for the output/result as well.

Jul 03 '24 09:07 hfp

I tried setting the DBCSR backend to others, and I didn't find a large number of H2D transfers in HIPprof. Therefore, I believe DBCSR is causing the issue. Additionally, it might be due to the transpose_d kernel. I couldn't locate the specific code responsible for the numerous H2D transfers. Below, I have attached the test file and output file. Thank you. @hfp test.tar.gz

Jul 04 '24 01:07 Schroedingers0216

For the record, if there are "unnecessary" data transfers like can be combined or avoided, this issue applies to all backends as well as GPUs/vendors. The hint on transposes might be a first step.

@zhl201226 you may try DBCSR_RUN_ON_GPU=0 environment variable and recapture the GPU-profile. This environment variable disables DBCSR on GPUs even if the support is compiled into the application (and leaves the other uses of CP2K on GPUs intact).

Jul 04 '24 07:07 hfp

Looking at CP2K's profile, local GEMMs (cp_fm_gemm) consume ~25% of the TTS on this system (just as a note). However, multiply_cannon* and dbcsr_mm_hostdrv_process are interesting. Given dbcsr_mm_hostdrv_process is relatively high, it seems there is a reasonable portion of fallbacks happening. Given previous implementation, the fallbacks may be accompanied by transfers without actually launching a kernel.

Jul 04 '24 07:07 hfp

I have identified the H2D issue occurring in the dbcsr_mm_accdrv_process module. Is this module dividing the data into small chunks for transfer? Can it be merged into larger chunks for transfer? Additionally, I previously did not use ACC to accelerate DBCSR, but it seems to be taking longer now. Therefore, I am not sure if DBCSR_RUN_ON_GPU=0 is effective. Could you please provide more optimization suggestions?

Jul 04 '24 07:07 Schroedingers0216

Sorry, I guess DBCSR_RUN_ON_GPU is only supported in the most recent if not unreleased version. This was not meant as an optimization suggestion but rather something to systematically rule-out or blame DBCSR. Your example input is worth looking at for contributors.

Jul 04 '24 07:07 hfp

How do I contact contributors? @hfp

Jul 04 '24 07:07 Schroedingers0216

Just give some time they will see this open issue ;-)

Jul 04 '24 07:07 hfp

Just give some time they will see this open issue ;-)

thank you ：-）

Jul 04 '24 07:07 Schroedingers0216

( Side note, GLOBAL| CPU model name does not show up in the log ;-)

Jul 04 '24 08:07 hfp

Regarding the test input, it's missing the restart file for the SCF initial guess. Commenting it out, starts from an unreasonable guess and fails in the Cholesky decomposition.

Jul 04 '24 08:07 hfp

( Side note, GLOBAL| CPU model name does not show up in the log ;-)

"By the way, using DBCSR_RUN_ON_GPU=0 did not significantly improve performance. The CPU model name has been hidden for other reasons, but I can provide it if needed."

Jul 04 '24 08:07 Schroedingers0216

Regarding the test input, it's missing the restart file for the SCF initial guess. Commenting it out, starts from an unreasonable guess and fails in the Cholesky decomposition.

This restart file is too large to upload. Is there another way to send it to you?

Jul 04 '24 08:07 Schroedingers0216

Hmm, others may have the same request so Dropbox or something like this comes to mind. My e-mail is my . name @ intel . com.

Jul 04 '24 09:07 hfp

Hmm, others may have the same request so Dropbox or something like this comes to mind. My e-mail is my . name @ intel . com.

I have already sent it to you via email. thank you

Jul 04 '24 09:07 Schroedingers0216

Hmm, others may have the same request so Dropbox or something like this comes to mind. My e-mail is my . name @ intel . com.

I have already sent it to you via email. thank you

( Let's see, the e-mail did not arrive yet perhaps size restrictions )

Jul 04 '24 15:07 hfp

Hmm, others may have the same request so Dropbox or something like this comes to mind. My e-mail is my . name @ intel . com.

I have already sent it to you via email. thank you

( Let's see, the e-mail did not arrive yet perhaps size restrictions )

I have resent it to [email protected]. Please check it. Best regards

Jul 05 '24 01:07 Schroedingers0216

I have resent it to [email protected]. Please check it. Best regards

Literally? I envisioned my.name would be my name taken from https://github.com/hfp (hans.pabst). Sorry for the confusion.

Jul 08 '24 14:07 hfp

I have resent it to [email protected]. Please check it. Best regards

Literally? I envisioned my.name would be my name taken from https://github.com/hfp (hans.pabst). Sorry for the confusion.

sure，I also sent an email to [email protected], and my email address is [[email protected]]

Jul 09 '24 02:07 Schroedingers0216

I am writing to seek your assistance. When running CP2K simulations on the AMD MI50 platform with the DBCSR backend set to HIP, the execution time is longer than using the CPU. Using HIPprof to examine the API calls, I noticed that the results show a large number of H2D (Host to Device) transfers but no kernel launches. Normally, the call flow should be H2D -> LaunchKernel -> D2H (Device to Host). I would like to understand why there are so many H2D transfers and where in the code this is occurring. Below, I have attached the JSON file for you to open in chrome://tracing.

Thank you.

The important CP2K timers for your execution are the following:

grid_integrate_task_list         340.326
grid_collocate_task_list         377.996
multiply_multrec                 523.278
cp_fm_syevd_base                 637.115
cp_fm_redistribute_end           639.139
dbcsr_mm_hostdrv_process        1229.836
cp_gemm_cosma                   2335.899
CP2K_Total                      8183.616

Now, I would assume you are running COSMA on the GPU, so you cannot gain more there. Then I see cp_fm_syevd_base, no sure if ELPA can give some benefit. It can also be the case for https://github.com/eth-cscs/DLA-Future. The grid parts are already running on the GPU.

Concerning DBCSR, the important part is the DBCSR kernel output:

 -------------------------------------------------------------------------------
 -                                                                             -
 -                                DBCSR STATISTICS                             -
 -                                                                             -
 -------------------------------------------------------------------------------
 COUNTER                                    TOTAL       BLAS       SMM       ACC
 flops     1 x     1 x     1                 3610       0.0%    100.0%      0.0%
 flops     1 x     1 x     5                19040       0.0%    100.0%      0.0%
...
 flops total                       537.243062E+12       0.0%     96.5%      3.5%
 flops max/rank                     35.731363E+12       0.0%     96.5%      3.5%
 matmuls inhomo. stacks                         0       0.0%      0.0%      0.0%
 matmuls total                        22844500215       0.0%     98.8%      1.2%
 number of processed stacks               3196393       0.0%     92.7%      7.3%
 average stack size                                     0.0    7614.1    1217.7

Basically, 98.8% of the blocks multiplications are running on the CPU (SMM column), only 1.2% is running on the GPU. The reason is that your kernels are not presented on the GPU tuning parameters. There are several ways to improve the situation (in order of preference):

Run the tuning procedure for the parameters you are interested and contribute to the current list.
You can try to set export DBCSR_MM_DENSE=1, you can see that the list of kernels should change and possibly more kernels will run on the GPU
Use the latest DCBSR (v2.7.0-rc2) which provides a default GPU kernel when the tuned kernels are not available.

Jul 10 '24 19:07 alazzaro

I am writing to seek your assistance. When running CP2K simulations on the AMD MI50 platform with the DBCSR backend set to HIP, the execution time is longer than using the CPU. Using HIPprof to examine the API calls, I noticed that the results show a large number of H2D (Host to Device) transfers but no kernel launches. Normally, the call flow should be H2D -> LaunchKernel -> D2H (Device to Host). I would like to understand why there are so many H2D transfers and where in the code this is occurring. Below, I have attached the JSON file for you to open in chrome://tracing. Thank you.

The important CP2K timers for your execution are the following:
grid_integrate_task_list         340.326
grid_collocate_task_list         377.996
multiply_multrec                 523.278
cp_fm_syevd_base                 637.115
cp_fm_redistribute_end           639.139
dbcsr_mm_hostdrv_process        1229.836
cp_gemm_cosma                   2335.899
CP2K_Total                      8183.616
Now, I would assume you are running COSMA on the GPU, so you cannot gain more there. Then I see cp_fm_syevd_base, no sure if ELPA can give some benefit. It can also be the case for https://github.com/eth-cscs/DLA-Future. The grid parts are already running on the GPU.

Concerning DBCSR, the important part is the DBCSR kernel output:
 -------------------------------------------------------------------------------
 -                                                                             -
 -                                DBCSR STATISTICS                             -
 -                                                                             -
 -------------------------------------------------------------------------------
 COUNTER                                    TOTAL       BLAS       SMM       ACC
 flops     1 x     1 x     1                 3610       0.0%    100.0%      0.0%
 flops     1 x     1 x     5                19040       0.0%    100.0%      0.0%
...
 flops total                       537.243062E+12       0.0%     96.5%      3.5%
 flops max/rank                     35.731363E+12       0.0%     96.5%      3.5%
 matmuls inhomo. stacks                         0       0.0%      0.0%      0.0%
 matmuls total                        22844500215       0.0%     98.8%      1.2%
 number of processed stacks               3196393       0.0%     92.7%      7.3%
 average stack size                                     0.0    7614.1    1217.7
Basically, 98.8% of the blocks multiplications are running on the CPU (SMM column), only 1.2% is running on the GPU. The reason is that your kernels are not presented on the GPU tuning parameters. There are several ways to improve the situation (in order of preference):

Run the tuning procedure for the parameters you are interested and contribute to the current list.

You can try to set export DBCSR_MM_DENSE=1, you can see that the list of kernels should change and possibly more kernels will run on the GPU

Use the latest DCBSR (v2.7.0-rc2) which provides a default GPU kernel when the tuned kernels are not available.

I will debug based on your suggestions later, but since the process is relatively long, I will temporarily close the question. Thank you very much.

Jul 11 '24 03:07 Schroedingers0216

1. Run the [tuning procedure](https://cp2k.github.io/dbcsr/develop/page/3-developer-guide/3-programming/2-accelerator-backend/2-libsmm_acc/index.html) for the parameters you are interested and contribute to the current list.

2. You can try to set `export DBCSR_MM_DENSE=1`, you can see that the list of kernels should change and possibly more kernels will run on the GPU

3. Use the latest DCBSR (v2.7.0-rc2) which provides a default GPU kernel when the tuned kernels are not available.

I am sure the OpenCL backend can be mixed with HIP as well (just like with CUDA). However, I have not spent any time to exercise this. It comes down to support in build system on CP2K's side. In any case, I will keep HIP in mind when taking this task (it's still open for me to get DBM/DBT and DBCSR based on OpenCL into CP2K's CMake).

Jul 11 '24 07:07 hfp

sorry, but I have to restart this issue.

1、When using the default GPU kernel, the dbcsr_mm_accdrv_process module is called very frequently, with call counts dbcsr_mm_accdrv_process 148432 18.9 90.392 111.997 148.383 177.363
such as I have to suspect that this is the main reason for the performance issues. In contrast, when using CUDA, there is no record of this function in the final list.

2、During kernel training, there are a large number of VMFAULT errors. I have tried making modifications, but there have been no significant improvements. How should I resolve this issue? Invalid address access: 0x343b05325000, Error code: 1.

KERNEL VMFault !!!! <<<<<<

PID: 8097, SIGNAL: 0 !!!! <<<<<< =========> HOSTQUEUE <0x1b59b0f0>: VMFault HSA QUEUE ANALYSIS <========= HOSTQUEUE <0x1b59b0f0>: get hsa queue W/R ptr: write index: 62961, read index: 62957 HOSTQUEUE <0x1b59b0f0>: >>>>>>>> DUMP KERNEL AQL PACKET <<<<<<<<< HOSTQUEUE <0x1b59b0f0>: header: 2818 HOSTQUEUE <0x1b59b0f0>: setup: 3 HOSTQUEUE <0x1b59b0f0>: workgroup: x:128, y:1, z:1 HOSTQUEUE <0x1b59b0f0>: grid: x:128128, y:1, z:1 HOSTQUEUE <0x1b59b0f0>: group_segment_size: 2240 HOSTQUEUE <0x1b59b0f0>: private_segment_size: 136 HOSTQUEUE <0x1b59b0f0>: kernel_object: 47532725616576

HOSTQUEUE <0x1b59b0f0>: device id: 0

HOSTQUEUE <0x1b59b0f0>: >>>>>>>> FIND MATCH KERNEL COMMAND <<<<<<<<< HOSTQUEUE <0x1b59b0f0>: kernel name: _Z20smm_acc_dnt_largeDB2ILi32ELi32ELi32ELi6ELi2ELi4ELi6ELi128ELi16ELi4EEvPKiiPKdS3_Pd

HOSTQUEUE <0x1b59b0f0>: >>>>>>>> DUMP KERNEL ARGS: size: 40 <<<<<<<<<

00 00 c0 05 3b 2b 00 00 85 3e 00 00 00 00 00 00 00 00 20 f5 3a 2b 00 00 00 00 00 00 3b 2b 00 00 00 00 20 05 3b 2b 00 00

HOSTQUEUE <0x1b59b0f0>: >>>>>>>> DUMP KERNEL ARGS PTR INFO <<<<<<<<< HOSTQUEUE <0x1b59b0f0>: ptr arg index: 0, ptr: 0x2b3b05c00000 HOSTQUEUE <0x1b59b0f0>: origin ptr: 0x2b3b05c00000, size byte: 192060

HOSTQUEUE <0x1b59b0f0>: ptr arg index: 2, ptr: 0x2b3af5200000 HOSTQUEUE <0x1b59b0f0>: origin ptr: 0x2b3af5200000, size byte: 81920000

HOSTQUEUE <0x1b59b0f0>: ptr arg index: 3, ptr: 0x2b3b00000000 HOSTQUEUE <0x1b59b0f0>: origin ptr: 0x2b3b00000000, size byte: 81920000

HOSTQUEUE <0x1b59b0f0>: ptr arg index: 4, ptr: 0x2b3b05200000 HOSTQUEUE <0x1b59b0f0>: origin ptr: 0x2b3b05200000, size byte: 8192000

=========> HOSTQUEUE <0x1b1dbb60>: VMFault HSA QUEUE ANALYSIS <========= params 6969 / 9136

@alazzaro @hfp

Aug 08 '24 01:08 Schroedingers0216

In my comment I gave some suggestions, especially on DBCSR. Since then, the new DBCSR and CP2K are out (2024.2), have you tried it?

Aug 08 '24 04:08 alazzaro

In my comment I gave some suggestions, especially on DBCSR. Since then, the new DBCSR and CP2K are out (2024.2), have you tried it?

Yes, I have tried all the suggestions you gave, but they seem not so ideal, especially these two parts.

Aug 08 '24 05:08 Schroedingers0216

Please, post the 2 cp2k logs (cuda and hip). There is no reason why the two calls of tge function should be different.

Aug 08 '24 05:08 alazzaro

Please, post the 2 cp2k logs (cuda and hip). There is no reason why the two calls of tge function should be different.

CP2K.log The call of the "dbcsr_mm_accdrv_process" module only appears in HIP.

Aug 08 '24 05:08 Schroedingers0216

Please, post the 2 cp2k logs (cuda and hip). There is no reason why the two calls of tge function should be different.

CP2K.log The call of the "dbcsr_mm_accdrv_process" module only appears in HIP.

Additionally, the vmfault error is preventing me from training a suitable kernel.

Aug 08 '24 06:08 Schroedingers0216

Please, post the 2 cp2k logs (cuda and hip). There is no reason why the two calls of tge function should be different.

CP2K.log The call of the "dbcsr_mm_accdrv_process" module only appears in HIP.

You don't need the training part if you are using the new CP2K 2024.2. could you add the DBCSR statistics to the log?

Aug 08 '24 06:08 alazzaro

请发布 2 个 cp2k 日志（cuda 和 hip）。没有理由说 tge 函数的两次调用应该不同。

CP2K.log “dbcsr_mm_accdrv_process”模块的调用仅出现在HIP中。

如果您使用的是新的 CP2K 2024.2，则不需要培训部分。您可以将 DBCSR 统计信息添加到日志中吗？

DBCSR.log SURE,this DBCSR.log

Aug 08 '24 06:08 Schroedingers0216