quda error during autotuning of blocksolver

Zech Gletzer, Carleton and I have all run into a reported error when trying to use the blocksolver with MILC. This has been reproduced on Cori GPU and Summit.
Here is the error message when running a 32^3 X 48 configuration on a single GPU: ERROR: Failed to clear error state an illegal memory access was encountered (rank 0, host cgpu03, /global/homes/s/steven/cori/gpu/quda/lib/tune.cpp:770 in tuneLaunch()) last kernel called was (name=N4quda15CopyColorSpinorILi1ENS_18CopyColorSpinorArgIffLi1ELi3ENS_11colorspinor11FloatNOrderIfLi1ELi3ELi2ELb0ELb0EEES4_EEEE,volume=16x32x32x48x1,aux=out_stride=1572864,in_stride=1572864)

Here is the error when running on 8 GPUs of a single Cori node: ERROR: Failed to clear error state an illegal memory access was encountered (rank 0, host cgpu07, /global/homes/s/steven/cori/gpu/quda/lib/tune.cpp:770 in tuneLaunch()) last kernel called was (name=N4quda15CopyColorSpinorILi1ENS_18CopyColorSpinorArgIffLi1ELi3ENS_11colorspinor11FloatNOrderIfLi1ELi3ELi2ELb0ELb0EEES4_EEEE,volume=16x16x16x24x1,aux=out_stride=196608,in_stride=196608)

Here are some details from Cori GPU. Modules for compilation: Currently Loaded Modulefiles:

esslurm 3) jdk/1.8.0_202 5) mvapich2/2.3.2
gcc/8.3.0 4) cuda/10.2.89 6) cmake/3.14.4

Details of QUDA version: UDA 1.0.0 (git v0.9.0-2577-g40bd6532f-sm_70) CUDA Driver version = 10020 CUDA Runtime version = 10020 Found device 0: Tesla V100-SXM2-16GB Found device 1: Tesla V100-SXM2-16GB Found device 2: Tesla V100-SXM2-16GB Found device 3: Tesla V100-SXM2-16GB Found device 4: Tesla V100-SXM2-16GB Found device 5: Tesla V100-SXM2-16GB Found device 6: Tesla V100-SXM2-16GB Found device 7: Tesla V100-SXM2-16GB Using device 0: Tesla V100-SXM2-16GB

MILC version 7.8.1 on devleop branch. Here are details from git log commit 43a239f8cef0d0f5d3af4c6318233cee917e8e07 (HEAD -> develop, origin/develop) Author: Carleton DeTar [email protected] Date: Sun Mar 8 22:41:52 2020 -0400

Move definitions of get_last_fn and set_last_fn to imp_ferm_links.h

I am attaching a tar file that contains Makefile and Make_template from milc_qcd/ks_spectrum The executable was ks_spectrum_hisq_multi. The tarfile also contains input_348, out.pretune and tunecache_error.tsv

debug.tar.gz

Steve

Mar 30 '20 22:03 stevengottlieb

Thanks for the report. The block solver has seen little attention lately. After the Dslash rewrite the initial implementation of the multiple rhs Dslash was dropped and the block solver itself was never properly updated. So, we still have the old branch `feature/blocksolver' which has been abandoned about 2 years ago. It might or might not work anymore, but for sure misses out on all the goodness that has gone into QUDA over the last 2 years. We won't apply any fixes to that.

As a quick fix we are trying to look into bringing the algorithm up to the current develop branch. I currently cannot give you an ETA for this as I need to check how much work is needed for that. If someone in USQCD is willing to put some work into it we are happy to support this.

However, this will not bring back proper multi-rhs Dslash. This has been on our list for quite some time now and actually has high priority as one of the top features for QUDA 2.0. That said, work has not started on that yet and we need to properly think about the design here.

As always with QUDA we don't usually prioritize work which has not been asked for / feature we are not aware people are using (as was the case for the block solver) so it never made it to the top of our lists. Is there a plan to get a block solver into MILC productions runs?

So, we'll keep you posted on the immediate fix as well as on the proper re-integration of the multi-rhs Dslash.

Apr 03 '20 05:04 mathiaswagner

We need to know whether the multiRHS feature will improve our ECP FOM benchmarks. If we see a significant improvement, we will certainly consider incorporating it into production running on Summit. In general we would be grateful for help in carrying out the next set of MILC Summit ECP benchmarks.

On 4/2/20 11:46 PM, Mathias Wagner wrote:

Thanks for the report. The block solver has seen little attention lately. After the Dslash rewrite the initial implementation of the multiple rhs Dslash was dropped and the block solver itself was never properly updated. So, we still have the old branch `feature/blocksolver' which has been abandoned about 2 years ago. It might or might not work anymore, but for sure misses out on all the goodness that has gone into QUDA over the last 2 years. We won't apply any fixes to that.

As a quick fix we are trying to look into bringing the algorithm up to the current develop branch. I currently cannot give you an ETA for this as I need to check how much work is needed for that. If someone in USQCD is willing to put some work into it we are happy to support this.

However, this will not bring back proper multi-rhs Dslash. This has been on our list for quite some time now and actually has high priority as one of the top features for QUDA 2.0. That said, work has not started on that yet and we need to properly think about the design here.

As always with QUDA we don't usually prioritize work which has not been asked for / feature we are not aware people are using (as was the case for the block solver) so it never made it to the top of our lists. Is there a plan to get a block solver into MILC productions runs?

So, we'll keep you posted on the immediate fix as well as on the proper re-integration of the multi-rhs Dslash.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lattice/quda/issues/976#issuecomment-608241997, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABP6HXRYTQXBJ3FDWXZY5P3RKVZ3PANCNFSM4LXCABKA.

Apr 04 '20 13:04 detar

Hi Mathias,

Thanks for getting back to us and sorry for my slow response. I want to add to what Carleton wrote. We are currently using a Grid blocksolver to study disconnected HVP. With a QUDA solver, we could do these calculations on GPU powered computers. Also, we are doing lots of analysis projects with multiple sources on each configuration, not just the one for which we are using Summit. IU will soon have Big Red 200, and I am eager to use the GPUs to the fullest extent possible.

Can you give us a hint as to how much work it would take a seasoned developer to bring multiRHS to the new standards? Also, which file would be the most relevant?

Thanks, Steve

On Sat, 2020-04-04 at 06:01 -0700, Carleton DeTar wrote:

We need to know whether the multiRHS feature will improve our ECP FOM benchmarks. If we see a significant improvement, we will certainly consider incorporating it into production running on Summit. In general we would be grateful for help in carrying out the next set of MILC Summit ECP benchmarks.

On 4/2/20 11:46 PM, Mathias Wagner wrote:

Thanks for the report. The block solver has seen little attention lately. After the Dslash rewrite the initial implementation of the multiple rhs Dslash was dropped and the block solver itself was never properly updated. So, we still have the old branch `feature/blocksolver' which has been abandoned about 2 years ago. It might or might not work anymore, but for sure misses out on all the goodness that has gone into QUDA over the last 2 years. We won't apply any fixes to that.

As a quick fix we are trying to look into bringing the algorithm up to the current develop branch. I currently cannot give you an ETA for this as I need to check how much work is needed for that. If someone in USQCD is willing to put some work into it we are happy to support this.

However, this will not bring back proper multi-rhs Dslash. This has been on our list for quite some time now and actually has high priority as one of the top features for QUDA 2.0. That said, work has not started on that yet and we need to properly think about the design here.

As always with QUDA we don't usually prioritize work which has not been asked for / feature we are not aware people are using (as was the case for the block solver) so it never made it to the top of our lists. Is there a plan to get a block solver into MILC productions runs?

So, we'll keep you posted on the immediate fix as well as on the proper re-integration of the multi-rhs Dslash.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub

https://github.com/lattice/quda/issues/976#issuecomment-608241997,

or unsubscribe

https://github.com/notifications/unsubscribe-auth/ABP6HXRYTQXBJ3FDWXZY5P3RKVZ3PANCNFSM4LXCABKA.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Apr 05 '20 15:04 stevengottlieb

quda quda copied to clipboard

error during autotuning of blocksolver

quda
quda copied to clipboard