dbcsr icon indicating copy to clipboard operation
dbcsr copied to clipboard

CP2K performs poorly on AMD platforms when using the DBCSR HIP backend.

Open Schroedingers0216 opened this issue 1 year ago • 20 comments

I am writing to seek your assistance. When running CP2K simulations on the AMD MI50 platform with the DBCSR backend set to HIP, the execution time is longer than using the CPU. Using HIPprof to examine the API calls, I noticed that the results show a large number of H2D (Host to Device) transfers but no kernel launches. Normally, the call flow should be H2D -> LaunchKernel -> D2H (Device to Host). I would like to understand why there are so many H2D transfers and where in the code this is occurring. Below, I have attached the JSON file for you to open in chrome://tracing.

Thank you.

Schroedingers0216 avatar Jul 03 '24 09:07 Schroedingers0216

If possible, can you share the input file and perhaps the profile output when running the workload? The profile output is what contains the timings printed by CP2K at the end. What's clear already, this is not only about DBCSR but also CP2K's GRID components (collocate/integrate), perhaps even some PW, etc.

Regarding, "H2D -> LaunchKernel -> D2H" - this is idealized assuming only a single transfer/array is the input of such kernel and in turn for the output/result as well.

hfp avatar Jul 03 '24 09:07 hfp

I tried setting the DBCSR backend to others, and I didn't find a large number of H2D transfers in HIPprof. Therefore, I believe DBCSR is causing the issue. Additionally, it might be due to the transpose_d kernel. I couldn't locate the specific code responsible for the numerous H2D transfers. Below, I have attached the test file and output file. Thank you. @hfp test.tar.gz

Schroedingers0216 avatar Jul 04 '24 01:07 Schroedingers0216

For the record, if there are "unnecessary" data transfers like can be combined or avoided, this issue applies to all backends as well as GPUs/vendors. The hint on transposes might be a first step.

@zhl201226 you may try DBCSR_RUN_ON_GPU=0 environment variable and recapture the GPU-profile. This environment variable disables DBCSR on GPUs even if the support is compiled into the application (and leaves the other uses of CP2K on GPUs intact).

hfp avatar Jul 04 '24 07:07 hfp

Looking at CP2K's profile, local GEMMs (cp_fm_gemm) consume ~25% of the TTS on this system (just as a note). However, multiply_cannon* and dbcsr_mm_hostdrv_process are interesting. Given dbcsr_mm_hostdrv_process is relatively high, it seems there is a reasonable portion of fallbacks happening. Given previous implementation, the fallbacks may be accompanied by transfers without actually launching a kernel.

hfp avatar Jul 04 '24 07:07 hfp

I have identified the H2D issue occurring in the dbcsr_mm_accdrv_process module. Is this module dividing the data into small chunks for transfer? Can it be merged into larger chunks for transfer? Additionally, I previously did not use ACC to accelerate DBCSR, but it seems to be taking longer now. Therefore, I am not sure if DBCSR_RUN_ON_GPU=0 is effective. Could you please provide more optimization suggestions?

Schroedingers0216 avatar Jul 04 '24 07:07 Schroedingers0216

Sorry, I guess DBCSR_RUN_ON_GPU is only supported in the most recent if not unreleased version. This was not meant as an optimization suggestion but rather something to systematically rule-out or blame DBCSR. Your example input is worth looking at for contributors.

hfp avatar Jul 04 '24 07:07 hfp

How do I contact contributors? @hfp

Schroedingers0216 avatar Jul 04 '24 07:07 Schroedingers0216

Just give some time they will see this open issue ;-)

hfp avatar Jul 04 '24 07:07 hfp

Just give some time they will see this open issue ;-)

thank you :-)

Schroedingers0216 avatar Jul 04 '24 07:07 Schroedingers0216

( Side note, GLOBAL| CPU model name does not show up in the log ;-)

hfp avatar Jul 04 '24 08:07 hfp

Regarding the test input, it's missing the restart file for the SCF initial guess. Commenting it out, starts from an unreasonable guess and fails in the Cholesky decomposition.

hfp avatar Jul 04 '24 08:07 hfp

( Side note, GLOBAL| CPU model name does not show up in the log ;-)

"By the way, using DBCSR_RUN_ON_GPU=0 did not significantly improve performance. The CPU model name has been hidden for other reasons, but I can provide it if needed." image

Schroedingers0216 avatar Jul 04 '24 08:07 Schroedingers0216

Regarding the test input, it's missing the restart file for the SCF initial guess. Commenting it out, starts from an unreasonable guess and fails in the Cholesky decomposition.

This restart file is too large to upload. Is there another way to send it to you?

Schroedingers0216 avatar Jul 04 '24 08:07 Schroedingers0216

Hmm, others may have the same request so Dropbox or something like this comes to mind. My e-mail is my . name @ intel . com.

hfp avatar Jul 04 '24 09:07 hfp

Hmm, others may have the same request so Dropbox or something like this comes to mind. My e-mail is my . name @ intel . com.

I have already sent it to you via email. thank you

Schroedingers0216 avatar Jul 04 '24 09:07 Schroedingers0216

Hmm, others may have the same request so Dropbox or something like this comes to mind. My e-mail is my . name @ intel . com.

I have already sent it to you via email. thank you

( Let's see, the e-mail did not arrive yet perhaps size restrictions )

hfp avatar Jul 04 '24 15:07 hfp

Hmm, others may have the same request so Dropbox or something like this comes to mind. My e-mail is my . name @ intel . com.

I have already sent it to you via email. thank you

( Let's see, the e-mail did not arrive yet perhaps size restrictions )

I have resent it to [email protected]. Please check it. Best regards

Schroedingers0216 avatar Jul 05 '24 01:07 Schroedingers0216

I have resent it to [email protected]. Please check it. Best regards

Literally? I envisioned my.name would be my name taken from https://github.com/hfp (hans.pabst). Sorry for the confusion.

hfp avatar Jul 08 '24 14:07 hfp

I have resent it to [email protected]. Please check it. Best regards

Literally? I envisioned my.name would be my name taken from https://github.com/hfp (hans.pabst). Sorry for the confusion.

sure,I also sent an email to [email protected], and my email address is [[email protected]]

Schroedingers0216 avatar Jul 09 '24 02:07 Schroedingers0216

I am writing to seek your assistance. When running CP2K simulations on the AMD MI50 platform with the DBCSR backend set to HIP, the execution time is longer than using the CPU. Using HIPprof to examine the API calls, I noticed that the results show a large number of H2D (Host to Device) transfers but no kernel launches. Normally, the call flow should be H2D -> LaunchKernel -> D2H (Device to Host). I would like to understand why there are so many H2D transfers and where in the code this is occurring. Below, I have attached the JSON file for you to open in chrome://tracing.

Thank you.

The important CP2K timers for your execution are the following:

grid_integrate_task_list         340.326
grid_collocate_task_list         377.996
multiply_multrec                 523.278
cp_fm_syevd_base                 637.115
cp_fm_redistribute_end           639.139
dbcsr_mm_hostdrv_process        1229.836
cp_gemm_cosma                   2335.899
CP2K_Total                      8183.616

Now, I would assume you are running COSMA on the GPU, so you cannot gain more there. Then I see cp_fm_syevd_base, no sure if ELPA can give some benefit. It can also be the case for https://github.com/eth-cscs/DLA-Future. The grid parts are already running on the GPU.

Concerning DBCSR, the important part is the DBCSR kernel output:

 -------------------------------------------------------------------------------
 -                                                                             -
 -                                DBCSR STATISTICS                             -
 -                                                                             -
 -------------------------------------------------------------------------------
 COUNTER                                    TOTAL       BLAS       SMM       ACC
 flops     1 x     1 x     1                 3610       0.0%    100.0%      0.0%
 flops     1 x     1 x     5                19040       0.0%    100.0%      0.0%
...
 flops total                       537.243062E+12       0.0%     96.5%      3.5%
 flops max/rank                     35.731363E+12       0.0%     96.5%      3.5%
 matmuls inhomo. stacks                         0       0.0%      0.0%      0.0%
 matmuls total                        22844500215       0.0%     98.8%      1.2%
 number of processed stacks               3196393       0.0%     92.7%      7.3%
 average stack size                                     0.0    7614.1    1217.7

Basically, 98.8% of the blocks multiplications are running on the CPU (SMM column), only 1.2% is running on the GPU. The reason is that your kernels are not presented on the GPU tuning parameters. There are several ways to improve the situation (in order of preference):

  1. Run the tuning procedure for the parameters you are interested and contribute to the current list.
  2. You can try to set export DBCSR_MM_DENSE=1, you can see that the list of kernels should change and possibly more kernels will run on the GPU
  3. Use the latest DCBSR (v2.7.0-rc2) which provides a default GPU kernel when the tuned kernels are not available.

alazzaro avatar Jul 10 '24 19:07 alazzaro

I am writing to seek your assistance. When running CP2K simulations on the AMD MI50 platform with the DBCSR backend set to HIP, the execution time is longer than using the CPU. Using HIPprof to examine the API calls, I noticed that the results show a large number of H2D (Host to Device) transfers but no kernel launches. Normally, the call flow should be H2D -> LaunchKernel -> D2H (Device to Host). I would like to understand why there are so many H2D transfers and where in the code this is occurring. Below, I have attached the JSON file for you to open in chrome://tracing. Thank you.

The important CP2K timers for your execution are the following:

grid_integrate_task_list         340.326
grid_collocate_task_list         377.996
multiply_multrec                 523.278
cp_fm_syevd_base                 637.115
cp_fm_redistribute_end           639.139
dbcsr_mm_hostdrv_process        1229.836
cp_gemm_cosma                   2335.899
CP2K_Total                      8183.616

Now, I would assume you are running COSMA on the GPU, so you cannot gain more there. Then I see cp_fm_syevd_base, no sure if ELPA can give some benefit. It can also be the case for https://github.com/eth-cscs/DLA-Future. The grid parts are already running on the GPU.

Concerning DBCSR, the important part is the DBCSR kernel output:

 -------------------------------------------------------------------------------
 -                                                                             -
 -                                DBCSR STATISTICS                             -
 -                                                                             -
 -------------------------------------------------------------------------------
 COUNTER                                    TOTAL       BLAS       SMM       ACC
 flops     1 x     1 x     1                 3610       0.0%    100.0%      0.0%
 flops     1 x     1 x     5                19040       0.0%    100.0%      0.0%
...
 flops total                       537.243062E+12       0.0%     96.5%      3.5%
 flops max/rank                     35.731363E+12       0.0%     96.5%      3.5%
 matmuls inhomo. stacks                         0       0.0%      0.0%      0.0%
 matmuls total                        22844500215       0.0%     98.8%      1.2%
 number of processed stacks               3196393       0.0%     92.7%      7.3%
 average stack size                                     0.0    7614.1    1217.7

Basically, 98.8% of the blocks multiplications are running on the CPU (SMM column), only 1.2% is running on the GPU. The reason is that your kernels are not presented on the GPU tuning parameters. There are several ways to improve the situation (in order of preference):

  1. Run the tuning procedure for the parameters you are interested and contribute to the current list.
  2. You can try to set export DBCSR_MM_DENSE=1, you can see that the list of kernels should change and possibly more kernels will run on the GPU
  3. Use the latest DCBSR (v2.7.0-rc2) which provides a default GPU kernel when the tuned kernels are not available.

I will debug based on your suggestions later, but since the process is relatively long, I will temporarily close the question. Thank you very much.

Schroedingers0216 avatar Jul 11 '24 03:07 Schroedingers0216

1. Run the [tuning procedure](https://cp2k.github.io/dbcsr/develop/page/3-developer-guide/3-programming/2-accelerator-backend/2-libsmm_acc/index.html) for the parameters you are interested and contribute to the current list.

2. You can try to set `export DBCSR_MM_DENSE=1`, you can see that the list of kernels should change and possibly more kernels will run on the GPU

3. Use the latest DCBSR (v2.7.0-rc2) which provides a default GPU kernel when the tuned kernels are not available.

I am sure the OpenCL backend can be mixed with HIP as well (just like with CUDA). However, I have not spent any time to exercise this. It comes down to support in build system on CP2K's side. In any case, I will keep HIP in mind when taking this task (it's still open for me to get DBM/DBT and DBCSR based on OpenCL into CP2K's CMake).

hfp avatar Jul 11 '24 07:07 hfp

sorry, but I have to restart this issue.

1、When using the default GPU kernel, the dbcsr_mm_accdrv_process module is called very frequently, with call counts dbcsr_mm_accdrv_process 148432 18.9 90.392 111.997 148.383 177.363
such as I have to suspect that this is the main reason for the performance issues. In contrast, when using CUDA, there is no record of this function in the final list.

2、During kernel training, there are a large number of VMFAULT errors. I have tried making modifications, but there have been no significant improvements. How should I resolve this issue? Invalid address access: 0x343b05325000, Error code: 1.

KERNEL VMFault !!!! <<<<<<

PID: 8097, SIGNAL: 0 !!!! <<<<<< =========> HOSTQUEUE <0x1b59b0f0>: VMFault HSA QUEUE ANALYSIS <========= HOSTQUEUE <0x1b59b0f0>: get hsa queue W/R ptr: write index: 62961, read index: 62957 HOSTQUEUE <0x1b59b0f0>: >>>>>>>> DUMP KERNEL AQL PACKET <<<<<<<<< HOSTQUEUE <0x1b59b0f0>: header: 2818 HOSTQUEUE <0x1b59b0f0>: setup: 3 HOSTQUEUE <0x1b59b0f0>: workgroup: x:128, y:1, z:1 HOSTQUEUE <0x1b59b0f0>: grid: x:128128, y:1, z:1 HOSTQUEUE <0x1b59b0f0>: group_segment_size: 2240 HOSTQUEUE <0x1b59b0f0>: private_segment_size: 136 HOSTQUEUE <0x1b59b0f0>: kernel_object: 47532725616576

HOSTQUEUE <0x1b59b0f0>: device id: 0

HOSTQUEUE <0x1b59b0f0>: >>>>>>>> FIND MATCH KERNEL COMMAND <<<<<<<<< HOSTQUEUE <0x1b59b0f0>: kernel name: _Z20smm_acc_dnt_largeDB2ILi32ELi32ELi32ELi6ELi2ELi4ELi6ELi128ELi16ELi4EEvPKiiPKdS3_Pd

HOSTQUEUE <0x1b59b0f0>: >>>>>>>> DUMP KERNEL ARGS: size: 40 <<<<<<<<<

00 00 c0 05 3b 2b 00 00 85 3e 00 00 00 00 00 00 00 00 20 f5 3a 2b 00 00 00 00 00 00 3b 2b 00 00 00 00 20 05 3b 2b 00 00

HOSTQUEUE <0x1b59b0f0>: >>>>>>>> DUMP KERNEL ARGS PTR INFO <<<<<<<<< HOSTQUEUE <0x1b59b0f0>: ptr arg index: 0, ptr: 0x2b3b05c00000 HOSTQUEUE <0x1b59b0f0>: origin ptr: 0x2b3b05c00000, size byte: 192060

HOSTQUEUE <0x1b59b0f0>: ptr arg index: 2, ptr: 0x2b3af5200000 HOSTQUEUE <0x1b59b0f0>: origin ptr: 0x2b3af5200000, size byte: 81920000

HOSTQUEUE <0x1b59b0f0>: ptr arg index: 3, ptr: 0x2b3b00000000 HOSTQUEUE <0x1b59b0f0>: origin ptr: 0x2b3b00000000, size byte: 81920000

HOSTQUEUE <0x1b59b0f0>: ptr arg index: 4, ptr: 0x2b3b05200000 HOSTQUEUE <0x1b59b0f0>: origin ptr: 0x2b3b05200000, size byte: 8192000

=========> HOSTQUEUE <0x1b1dbb60>: VMFault HSA QUEUE ANALYSIS <========= params 6969 / 9136

@alazzaro @hfp

Schroedingers0216 avatar Aug 08 '24 01:08 Schroedingers0216

In my comment I gave some suggestions, especially on DBCSR. Since then, the new DBCSR and CP2K are out (2024.2), have you tried it?

alazzaro avatar Aug 08 '24 04:08 alazzaro

In my comment I gave some suggestions, especially on DBCSR. Since then, the new DBCSR and CP2K are out (2024.2), have you tried it?

Yes, I have tried all the suggestions you gave, but they seem not so ideal, especially these two parts.

Schroedingers0216 avatar Aug 08 '24 05:08 Schroedingers0216

Please, post the 2 cp2k logs (cuda and hip). There is no reason why the two calls of tge function should be different.

alazzaro avatar Aug 08 '24 05:08 alazzaro

Please, post the 2 cp2k logs (cuda and hip). There is no reason why the two calls of tge function should be different.

CP2K.log The call of the "dbcsr_mm_accdrv_process" module only appears in HIP.

Schroedingers0216 avatar Aug 08 '24 05:08 Schroedingers0216

Please, post the 2 cp2k logs (cuda and hip). There is no reason why the two calls of tge function should be different.

CP2K.log The call of the "dbcsr_mm_accdrv_process" module only appears in HIP.

Additionally, the vmfault error is preventing me from training a suitable kernel.

Schroedingers0216 avatar Aug 08 '24 06:08 Schroedingers0216

Please, post the 2 cp2k logs (cuda and hip). There is no reason why the two calls of tge function should be different.

CP2K.log The call of the "dbcsr_mm_accdrv_process" module only appears in HIP.

You don't need the training part if you are using the new CP2K 2024.2. could you add the DBCSR statistics to the log?

alazzaro avatar Aug 08 '24 06:08 alazzaro

请发布 2 个 cp2k 日志(cuda 和 hip)。没有理由说 tge 函数的两次调用应该不同。

CP2K.log “dbcsr_mm_accdrv_process”模块的调用仅出现在HIP中。

如果您使用的是新的 CP2K 2024.2,则不需要培训部分。您可以将 DBCSR 统计信息添加到日志中吗?

DBCSR.log SURE,this DBCSR.log

Schroedingers0216 avatar Aug 08 '24 06:08 Schroedingers0216