scalapack
scalapack copied to clipboard
Skip update part in pdtrord for current WINDOW if NWIN = 0
Try simple solution to solve #69.
Is this a proper solution? Isn't the issue that the parallelization triggers NWIN=0 at all?
Is this a proper solution? Isn't the issue that the parallelization triggers NWIN=0 at all?
Thanks for asking. The answer has 2 parts:
A. This is a proper solution for the tests cases. Some other reasons for this fix:
- The update part does not make sense if
NWIN=0
. - I verified that the tests result in
FLOPS = 0
whenNWIN=0
. Accordingly to the algorithm logic, the update should be neglected whenFLOPS = 0
. I could have proposed a fix like the following:
IF( FLOPS.NE.0 ) THEN
IF( ( FLOPS*100 ) / ( 2*NWIN*NWIN ) .GE. MMULT ) THEN
[...]
ELSE
[...]
END IF
END IF
But this solution, in my opinion, would also be a problem if for some reason NWIN=0
and FLOPS
is non zero. My solution is based on reasoning 1.
B. This is not a proper solution for the possible communication issue we have. I will give a short report for what I obtained.
- My machine: Ubuntu 20.04, 8 cores (16 threads), using OpenMPI and GNU compilers. I can give more information if needed.
-
NWIN
is always different from zero and the tests pass when runningxdhseqr
with 1 or more than 8 MPI processes. I obtainNWIN=0
when using 8 MPI processes. -
I obtain
NWIN=0
when using 2 - 8 MPI processes. I does not matter if I usebind-to
,oversubscribe
ormap-by
flags. At least I couldn't find anything else by playing with those flags. -
When
NWIN=0
, there are 2 scenarios: i.NMWIN2 = 2
. In this case,LILO = LIHI+1
andI = LILO
. ii.NMWIN2 = 1
. In this case,LILO < LIHI
andI
is trash, sometimes even a negative number.
We are still investigating this problem.
I would like to add a few more information to this bug. Before I see this post, I used multiple compiler-mpi combination to compile and test scalapack-2.2.0. I tried gcc-8.2.0+openmpi-4.0.0, gcc-10.1.0+openmpi-4.0.5, gcc-11.2.0+openmpi-4.1.2 and intel_oneapi-2021. The gcc-8.2.0 and intel-2021 passed all test; while using the compiler gcc-10 and gcc-11 to compile will fail for Test #69 xshseqr and Test #70: xdhseqr. These two tests failed only for mpirun with np = 2-8; the tests failed no matter you are using gcc-10 or gcc-11 or gcc-8 during run-time. If you compile with gcc-8 and test with gcc-10, it goes fine.
The solution given by [weslleyspereira] is effective, but remember you need to edit both "pstrord.f" and "pdtrord.f" in order to pass those two tests, respectively. It is somewhat surprised to me that the failed test results from a program bug instead of settings from the compiler or mpi, and I have spent a lot of time checking and comparing the settings and object files in my system. So I write this post hoping people with the same situation can find this solution earlier and save them some time :)