netcdf-c icon indicating copy to clipboard operation
netcdf-c copied to clipboard

ncdump_tst_netcdf4_4 fails on x86

Open hjaekel opened this issue 2 years ago • 9 comments

I'm trying to package netcdf-c 4.9.1 on Alpine Linux Edge. The tests pass on all platforms except one test: ncdump_tst_netcdf4_4.

43/249 Test  #51: ncdump_tst_ncgen4 .....................   Passed   15.28 sec
        Start  52: ncdump_tst_netcdf4_4
 44/249 Test  #52: ncdump_tst_netcdf4_4 ..................***Failed    0.06 sec
*** Running extra netcdf-4 tests.
*** running tst_string_data to create test files...
*** Testing strings.
*** creating strings test file tst_string_data.nc...ok.
*** Tests successful!
*** dumping tst_string_data.nc to tst_string_data.cdl...
*** comparing tst_string_data.cdl with ref_tst_string_data.cdl...
*** testing reference file ref_tst_compounds2.nc...
*** testing reference file ref_tst_compounds3.nc...
*** testing reference file ref_tst_compounds4.nc...
--- tst_ncf213.tmp
+++ ref_tst_ncf213.tmp
@@ -34,7 +34,7 @@
 	obs_t var5(dim1) ;
 		var5:_Storage = "chunked" ;
 		var5:_ChunkSizes = 6 ;
-		var5:_Filter = "3|2,36|1,2" ;
+ 		var5:_Filter = "3|2,40|1,2" ;
 		var5:_NoFill = "true" ;
 
 // global attributes:
        Start  53: ncdump_tst_nccopy4
 45/249 Test  #53: ncdump_tst_nccopy4 ....................   Passed    1.31 sec

I use the following statements to compile and execute the tests:

local _enable_cdf5=ON
case "$CARCH" in
	x86|armhf|armv7) _enable_cdf5=OFF ;;
esac
cmake -B build -G Ninja \
	-DCMAKE_INSTALL_PREFIX=/usr \
	-DCMAKE_INSTALL_LIBDIR=lib \
	-DCMAKE_BUILD_TYPE=None \
	-DENABLE_CDF5=$_enable_cdf5 \
	-DENABLE_DAP_LONG_TESTS=ON \
	-DENABLE_EXAMPLE_TESTS=ON \
	-DENABLE_EXTRA_TESTS=ON \
	-DENABLE_FAILING_TESTS=ON \
	-DENABLE_FILTER_TESTING=ON \
	-DENABLE_LARGE_FILE_TESTS=ON
cmake --build build
cd build
CTEST_OUTPUT_ON_FAILURE=1 ctest -E "nc_test4_tst_large2"

hjaekel avatar Feb 12 '23 15:02 hjaekel

This is a known problem. It as to do with running tests in parallel during make check. There is a race condition that we have not yet found. If you re-run make check, the odds are good that it will work.

DennisHeimbigner avatar Feb 12 '23 19:02 DennisHeimbigner

Well this can be fixed by adding the right line to Makefile.am. See https://stackoverflow.com/questions/17172310/make-disable-parallel-building-in-subdirectory-for-single-target-only.

edwardhartnett avatar Feb 13 '23 07:02 edwardhartnett

We use Ninja, so I guess the change in the Makefile will have no effect. I tried with ctest -j 1 with the same test failure than before. Finally I switched to

CTEST_OUTPUT_ON_FAILURE=1 ctest -R "ncdump_tst_netcdf4_4"
CTEST_OUTPUT_ON_FAILURE=1 ctest -E "ncdump_tst_netcdf4_4 nc_test4_tst_large2"

This should prevent race conditions from occurring. However, the test still fails on x86. This is reproducible and only happens on x86. On all other platforms (aarch64, armhf, armv7, ppc64le and x86_64) the test runs successfully. You can see the ci pipelines here: https://gitlab.alpinelinux.org/hjaekel/aports/-/pipelines/153046

hjaekel avatar Feb 13 '23 10:02 hjaekel

I have the same FAIL in check. (46 PASS and 1 FAIL) I tried to run the script netcdf-c/ncdump/tst_netcdf4_4.sh independently and I think the problem is related to ncgen. The type of filter applied to variable 5 changes : var5:_Filter = "3|2,40|1,2" --> var5:_Filter = "3|2,36|1,2" Could you help me ? Thanks

mikpos-84 avatar Sep 23 '24 13:09 mikpos-84

After a quick look, I think this may be a compound type packing problem. Specifically, the middle filter 2 refers to the shuffle filter. It technically has no argument, but apparently, the size of the compound type is being included as an argument for the filter. So in this case, the baseline file assumes that the compound type size is 40, but on the platform/compiler you are using, it has a size of 36. I will investigate, if I can, if my speculation is correct. Do you know what compiler and compiler version you are using?

DennisHeimbigner avatar Sep 23 '24 17:09 DennisHeimbigner

Thanks a lot for the support. The compiler version is GCC 4.4.7 20120313 (Red Hat 4.4.7-18).

mikpos-84 avatar Sep 23 '24 18:09 mikpos-84

That is a pretty old version of gcc, I think. I am not sure we can fix the problem if it is struct type packing issue. Any chance you test against a much more recent version of gcc. Perhaps you have a similar platform with a newer version of gcc?

DennisHeimbigner avatar Sep 23 '24 18:09 DennisHeimbigner

I know that the GCC version is very old, but at the moment I can't update it, it's a constraint. From what I understand by running only the test netcdf-c/ncdump/tst_netcdf4_4.sh, I exclude the suspicion of the race condition related to a parallel execution of the tests, and I attribute the fail to the compiler. Is this correct? This would mean that the library compiled in this way is to be considered "corrupted" and could cause problems in use. Thanks.

mikpos-84 avatar Sep 24 '24 08:09 mikpos-84

I have not investigated thoroughly, but yes, in my opinion, the failure is due to a change in the gcc compiler. Presumably as more people use that compiler version, we will start to see reports of similar failures.

DennisHeimbigner avatar Sep 24 '24 14:09 DennisHeimbigner