ginkgo icon indicating copy to clipboard operation
ginkgo copied to clipboard

make test fails for tests 128 (MPI) and 195 (numerical discrepancy)

Open stanisic opened this issue 2 years ago • 8 comments

Hi,

When compiling Ginkgo and then running make test, the following two tests fail (all others work fine):

99% tests passed, 2 tests failed out of 205

Total Test time (real) =  75.06 sec

The following tests FAILED:
        128 - core/test/mpi/base/bindings (Failed)
        195 - test/matrix/dense_kernels_omp (Failed)
Errors while running CTest

Error in the test 128 seems to be related to the MPI_Win_flush call, as suggested by the output below. I have not seen such an error on my system before, but it is possible that this is related to my Easybuild installation of OpenMPI/4.1.1.

[icx-00:26873] *** An error occurred in MPI_Win_flush
[icx-00:26873] *** reported by process [633077761,0]
[icx-00:26873] *** on win rdma window 4
[icx-00:26873] *** MPI_ERR_RMA_SYNC: error executing rma sync
[icx-00:26873] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[icx-00:26873] ***    and potentially your MPI job)

Error in the test 195 seems to be related to the minor numerical discrepancies, as suggested by the output:

[ RUN      ] Dense.ComputeDotIsEquivalentToRef
/home/l00568700/tmp/bug_report/ginkgo/test/matrix/dense_kernels.cpp:1013: Failure
Relative error between ddot and dot_expected is 2.7026992280801769e-15
        which is larger than r<vtype>::value (which is 2.2204460492503131e-15)
ddot is:
        11.173298884336093
dot_expected is:
        11.173298884336063
component-wise relative error is:
        2.7026992280801765e-15

[  FAILED  ] Dense.ComputeDotIsEquivalentToRef (0 ms)
[ RUN      ] Dense.ComputeDotWithPreallocatedTmpIsEquivalentToRef
[       OK ] Dense.ComputeDotWithPreallocatedTmpIsEquivalentToRef (17 ms)
[ RUN      ] Dense.ComputeDotWithTmpIsEquivalentToRef
[       OK ] Dense.ComputeDotWithTmpIsEquivalentToRef (17 ms)
[ RUN      ] Dense.ComputeConjDotIsEquivalentToRef
/home/l00568700/tmp/bug_report/ginkgo/test/matrix/dense_kernels.cpp:1063: Failure
Relative error between ddot and dot_expected is 2.7026992280801769e-15
        which is larger than r<vtype>::value (which is 2.2204460492503131e-15)
ddot is:
        11.173298884336093
dot_expected is:
        11.173298884336063
component-wise relative error is:
        2.7026992280801765e-15

[  FAILED  ] Dense.ComputeConjDotIsEquivalentToRef (0 ms)

Regarding my setup, I was able to reproduce this issue on both Intel and ARM machines (machines do not have accelerators). I am using Ubuntu 18.04 LTS and GCC 10.3, and I have cloned develop branch (commit 75b2557763) on 28.06.2022. Please let me know if more info about my system is necessary.

This error is not critical for my usage of Ginkgo, but I report it here since I thought it might be useful for the developers and the community.

Best, Luka from Huawei Munich Research Center

stanisic avatar Jun 29 '22 09:06 stanisic

The test error for 195 is not an issue, we just need to relax the error bounds slightly, it was probably tripped since we have two values instead of a single value, so the error bounds are sqrt(2) too small. @pratikvn can you take a look at the MPI issue?

upsj avatar Jun 29 '22 09:06 upsj

I can reproduce the MPI issue on WSL2 with GCC 12.1 and OpenMPI 4.1.4

upsj avatar Jul 04 '22 15:07 upsj

Unfortunately, I cannot seem to reproduce this. I also installed openmpi-4.1.1 and openmpi-4.1.4 and tested it with both. All tests pass for me.

pratikvn avatar Jul 18 '22 10:07 pratikvn

@stanisic do I recall correctly that you were running in a virtual machine? Maybe that explains why WSL and your system are failing, but the error can't be reproduced elsewhere?

upsj avatar Jul 18 '22 15:07 upsj

The numerical issue was fixed by #1083

upsj avatar Jul 19 '22 07:07 upsj

No, I am not using a virtual machine. I was able to reproduce this issue on both ARM and Intel machines, running with Ubuntu 18.04 LTS and GCC 10.3. My OpenMPI 4.1.1 is coming from Easybuild.

stanisic avatar Jul 19 '22 07:07 stanisic

I see. Thanks. If possible could you share your easybuild config, so that I can try and reproduce the problem ?

pratikvn avatar Jul 19 '22 08:07 pratikvn

We forked from develop branch of Easybuild around one year ago. In comparison to today's config of the same package, I can see that only 2 CUDA related patches were added, which should not have any impact on our setup as we do not use accelerators on these machines. Here is the config used on our machines:

name = 'OpenMPI'
version = '4.1.1'

homepage = 'https://www.open-mpi.org/'
description = """The Open MPI Project is an open source MPI-3 implementation."""

toolchain = {'name': 'GCC', 'version': '10.3.0'}

source_urls = ['https://www.open-mpi.org/software/ompi/v%(version_major_minor)s/downloads']
sources = [SOURCELOWER_TAR_BZ2]
patches = [
    'OpenMPI-4.1.1_fix-bufferoverflow-in-common_ofi.patch',
    'OpenMPI-4.0.6_remove-pmix-check-in-pmi-switch.patch',
    'OpenMPI-4.1.0-1-pml-ucx-datatype-memleak.patch',
]
checksums = [
    'e24f7a778bd11a71ad0c14587a7f5b00e68a71aa5623e2157bafee3d44c07cda',  # openmpi-4.1.1.tar.bz2
    # OpenMPI-4.1.1_fix-bufferoverflow-in-common_ofi.patch
    'a189d834506f3d7c31eda6aa184598a3631ea24a94bc551d5ed1f053772ca49e',
    # OpenMPI-4.0.6_remove-pmix-check-in-pmi-switch.patch
    '8acee6c9b2b4bf12873a39b85a58ca669de78e90d26186e52f221bb4853abc4d',
    # OpenMPI-4.1.0-1-pml-ucx-datatype-memleak.patch
    'a94a74b174ce783328abfd3656ff5196b89ef4c819fe4c8b8a0f1277123e76ea',
]

builddependencies = [
    ('pkg-config', '0.29.2'),
]

dependencies = [
    ('zlib', '1.2.11'),
    ('hwloc', '2.4.1'),
    ('libevent', '2.1.12'),
    ('UCX', '1.10.0'),
    ('libfabric', '1.12.1'),
    ('PMIx', '3.2.3'),
]

# disable MPI1 compatibility for now, see what breaks...
# configopts = '--enable-mpi1-compatibility '

# to enable SLURM integration (site-specific)
# configopts += '--with-slurm --with-pmi=/usr/include/slurm --with-pmi-libdir=/usr'

moduleclass = 'mpi'

stanisic avatar Jul 25 '22 07:07 stanisic