ginkgo
ginkgo copied to clipboard
make test fails for tests 128 (MPI) and 195 (numerical discrepancy)
Hi,
When compiling Ginkgo and then running make test, the following two tests fail (all others work fine):
99% tests passed, 2 tests failed out of 205
Total Test time (real) = 75.06 sec
The following tests FAILED:
128 - core/test/mpi/base/bindings (Failed)
195 - test/matrix/dense_kernels_omp (Failed)
Errors while running CTest
Error in the test 128 seems to be related to the MPI_Win_flush call, as suggested by the output below. I have not seen such an error on my system before, but it is possible that this is related to my Easybuild installation of OpenMPI/4.1.1.
[icx-00:26873] *** An error occurred in MPI_Win_flush
[icx-00:26873] *** reported by process [633077761,0]
[icx-00:26873] *** on win rdma window 4
[icx-00:26873] *** MPI_ERR_RMA_SYNC: error executing rma sync
[icx-00:26873] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[icx-00:26873] *** and potentially your MPI job)
Error in the test 195 seems to be related to the minor numerical discrepancies, as suggested by the output:
[ RUN ] Dense.ComputeDotIsEquivalentToRef
/home/l00568700/tmp/bug_report/ginkgo/test/matrix/dense_kernels.cpp:1013: Failure
Relative error between ddot and dot_expected is 2.7026992280801769e-15
which is larger than r<vtype>::value (which is 2.2204460492503131e-15)
ddot is:
11.173298884336093
dot_expected is:
11.173298884336063
component-wise relative error is:
2.7026992280801765e-15
[ FAILED ] Dense.ComputeDotIsEquivalentToRef (0 ms)
[ RUN ] Dense.ComputeDotWithPreallocatedTmpIsEquivalentToRef
[ OK ] Dense.ComputeDotWithPreallocatedTmpIsEquivalentToRef (17 ms)
[ RUN ] Dense.ComputeDotWithTmpIsEquivalentToRef
[ OK ] Dense.ComputeDotWithTmpIsEquivalentToRef (17 ms)
[ RUN ] Dense.ComputeConjDotIsEquivalentToRef
/home/l00568700/tmp/bug_report/ginkgo/test/matrix/dense_kernels.cpp:1063: Failure
Relative error between ddot and dot_expected is 2.7026992280801769e-15
which is larger than r<vtype>::value (which is 2.2204460492503131e-15)
ddot is:
11.173298884336093
dot_expected is:
11.173298884336063
component-wise relative error is:
2.7026992280801765e-15
[ FAILED ] Dense.ComputeConjDotIsEquivalentToRef (0 ms)
Regarding my setup, I was able to reproduce this issue on both Intel and ARM machines (machines do not have accelerators). I am using Ubuntu 18.04 LTS and GCC 10.3, and I have cloned develop branch (commit 75b2557763) on 28.06.2022. Please let me know if more info about my system is necessary.
This error is not critical for my usage of Ginkgo, but I report it here since I thought it might be useful for the developers and the community.
Best, Luka from Huawei Munich Research Center
The test error for 195 is not an issue, we just need to relax the error bounds slightly, it was probably tripped since we have two values instead of a single value, so the error bounds are sqrt(2) too small. @pratikvn can you take a look at the MPI issue?
I can reproduce the MPI issue on WSL2 with GCC 12.1 and OpenMPI 4.1.4
Unfortunately, I cannot seem to reproduce this. I also installed openmpi-4.1.1 and openmpi-4.1.4 and tested it with both. All tests pass for me.
@stanisic do I recall correctly that you were running in a virtual machine? Maybe that explains why WSL and your system are failing, but the error can't be reproduced elsewhere?
The numerical issue was fixed by #1083
No, I am not using a virtual machine. I was able to reproduce this issue on both ARM and Intel machines, running with Ubuntu 18.04 LTS and GCC 10.3. My OpenMPI 4.1.1 is coming from Easybuild.
I see. Thanks. If possible could you share your easybuild config, so that I can try and reproduce the problem ?
We forked from develop branch of Easybuild around one year ago. In comparison to today's config of the same package, I can see that only 2 CUDA related patches were added, which should not have any impact on our setup as we do not use accelerators on these machines. Here is the config used on our machines:
name = 'OpenMPI'
version = '4.1.1'
homepage = 'https://www.open-mpi.org/'
description = """The Open MPI Project is an open source MPI-3 implementation."""
toolchain = {'name': 'GCC', 'version': '10.3.0'}
source_urls = ['https://www.open-mpi.org/software/ompi/v%(version_major_minor)s/downloads']
sources = [SOURCELOWER_TAR_BZ2]
patches = [
'OpenMPI-4.1.1_fix-bufferoverflow-in-common_ofi.patch',
'OpenMPI-4.0.6_remove-pmix-check-in-pmi-switch.patch',
'OpenMPI-4.1.0-1-pml-ucx-datatype-memleak.patch',
]
checksums = [
'e24f7a778bd11a71ad0c14587a7f5b00e68a71aa5623e2157bafee3d44c07cda', # openmpi-4.1.1.tar.bz2
# OpenMPI-4.1.1_fix-bufferoverflow-in-common_ofi.patch
'a189d834506f3d7c31eda6aa184598a3631ea24a94bc551d5ed1f053772ca49e',
# OpenMPI-4.0.6_remove-pmix-check-in-pmi-switch.patch
'8acee6c9b2b4bf12873a39b85a58ca669de78e90d26186e52f221bb4853abc4d',
# OpenMPI-4.1.0-1-pml-ucx-datatype-memleak.patch
'a94a74b174ce783328abfd3656ff5196b89ef4c819fe4c8b8a0f1277123e76ea',
]
builddependencies = [
('pkg-config', '0.29.2'),
]
dependencies = [
('zlib', '1.2.11'),
('hwloc', '2.4.1'),
('libevent', '2.1.12'),
('UCX', '1.10.0'),
('libfabric', '1.12.1'),
('PMIx', '3.2.3'),
]
# disable MPI1 compatibility for now, see what breaks...
# configopts = '--enable-mpi1-compatibility '
# to enable SLURM integration (site-specific)
# configopts += '--with-slurm --with-pmi=/usr/include/slurm --with-pmi-libdir=/usr'
moduleclass = 'mpi'