scalapack icon indicating copy to clipboard operation
scalapack copied to clipboard

Skip update part in pdtrord for current WINDOW if NWIN = 0

Open weslleyspereira opened this issue 2 years ago • 4 comments

Try simple solution to solve #69.

weslleyspereira avatar Jul 26 '22 23:07 weslleyspereira

Is this a proper solution? Isn't the issue that the parallelization triggers NWIN=0 at all?

drhpc avatar Jul 27 '22 08:07 drhpc

Is this a proper solution? Isn't the issue that the parallelization triggers NWIN=0 at all?

Thanks for asking. The answer has 2 parts:

A. This is a proper solution for the tests cases. Some other reasons for this fix:

  1. The update part does not make sense if NWIN=0.
  2. I verified that the tests result in FLOPS = 0 when NWIN=0. Accordingly to the algorithm logic, the update should be neglected when FLOPS = 0. I could have proposed a fix like the following:
               IF( FLOPS.NE.0 ) THEN
                  IF( ( FLOPS*100 ) / ( 2*NWIN*NWIN ) .GE. MMULT ) THEN
                     [...]
                  ELSE
                     [...]
                  END IF
               END IF

But this solution, in my opinion, would also be a problem if for some reason NWIN=0 and FLOPS is non zero. My solution is based on reasoning 1.

B. This is not a proper solution for the possible communication issue we have. I will give a short report for what I obtained.

  • My machine: Ubuntu 20.04, 8 cores (16 threads), using OpenMPI and GNU compilers. I can give more information if needed.
  1. NWIN is always different from zero and the tests pass when running xdhseqr with 1 or more than 8 MPI processes. I obtain NWIN=0 when using 8 MPI processes.

  2. I obtain NWIN=0 when using 2 - 8 MPI processes. I does not matter if I use bind-to, oversubscribe or map-by flags. At least I couldn't find anything else by playing with those flags.

  3. When NWIN=0, there are 2 scenarios: i. NMWIN2 = 2. In this case, LILO = LIHI+1 and I = LILO. ii. NMWIN2 = 1. In this case, LILO < LIHI and I is trash, sometimes even a negative number.

We are still investigating this problem.

weslleyspereira avatar Jul 27 '22 14:07 weslleyspereira

I would like to add a few more information to this bug. Before I see this post, I used multiple compiler-mpi combination to compile and test scalapack-2.2.0. I tried gcc-8.2.0+openmpi-4.0.0, gcc-10.1.0+openmpi-4.0.5, gcc-11.2.0+openmpi-4.1.2 and intel_oneapi-2021. The gcc-8.2.0 and intel-2021 passed all test; while using the compiler gcc-10 and gcc-11 to compile will fail for Test #69 xshseqr and Test #70: xdhseqr. These two tests failed only for mpirun with np = 2-8; the tests failed no matter you are using gcc-10 or gcc-11 or gcc-8 during run-time. If you compile with gcc-8 and test with gcc-10, it goes fine.

The solution given by [weslleyspereira] is effective, but remember you need to edit both "pstrord.f" and "pdtrord.f" in order to pass those two tests, respectively. It is somewhat surprised to me that the failed test results from a program bug instead of settings from the compiler or mpi, and I have spent a lot of time checking and comparing the settings and object files in my system. So I write this post hoping people with the same situation can find this solution earlier and save them some time :)

liyangzhong7 avatar Dec 14 '23 11:12 liyangzhong7