OpenImageIO icon indicating copy to clipboard operation
OpenImageIO copied to clipboard

ctest -j16 failed about [email protected] on centos8

Open Tom-python0121 opened this issue 1 year ago • 20 comments

Describe the bug Hello, I meet a problem:ctest -j16 failed about [email protected] on centos8

Steps to reproduce the issue

os:centos8_x86_64 
command:ctest -j16
errors:
          1 - cmake-consumer (Failed)
          3 - igrep (Failed)
          5 - oiiotool (Failed)
          8 - oiiotool-copy (Failed)
         12 - oiiotool-subimage (Failed)
         13 - oiiotool-text (Failed)
         19 - maketx (Failed)
         20 - oiiotool-maketx (Failed)
         39 - texture-levels-stochaniso (Failed)
         40 - texture-levels-stochmip (Failed)
         49 - texture-udim (Failed)
         50 - texture-udim2 (Failed)
         73 - texture-levels-stochaniso.batch (Failed)
         74 - texture-levels-stochmip.batch (Failed)
         83 - texture-udim.batch (Failed)
         84 - texture-udim2.batch (Failed)
         93 - oiiotool-color-broken (Failed)
         94 - heif-broken (Failed)
         95 - openvdb-broken (Failed)
         96 - texture-texture3d-broken (Failed)
         97 - openvdb.batch-broken (Failed)
         98 - texture-texture3d.batch-broken (Failed)
         99 - ptex-broken (Failed)
        113 - unit_timer (Failed)

os:centos8_aarch64
command:ctest -j16
errors:
          1 - cmake-consumer (Failed)
          3 - igrep (Failed)
         13 - oiiotool-text (Failed)
         19 - maketx (Failed)
         20 - oiiotool-maketx (Failed)
         32 - texture-crop (Failed)
         34 - texture-half (Failed)
         35 - texture-uint16 (Failed)
         37 - texture-interp-bilinear (Failed)
         38 - texture-interp-closest (Failed)
         39 - texture-levels-stochaniso (Failed)
         40 - texture-levels-stochmip (Failed)
         42 - texture-mip-onelevel (Failed)
         44 - texture-mip-stochastictrilinear (Failed)
         45 - texture-mip-stochasticaniso (Failed)
         51 - texture-uint8 (Failed)
         55 - texture-skinny (Failed)
         73 - texture-levels-stochaniso.batch (Failed)
         74 - texture-levels-stochmip.batch (Failed)
         93 - oiiotool-color-broken (Failed)
         94 - heif-broken (Failed)
         95 - openvdb-broken (Failed)
         96 - texture-texture3d-broken (Failed)
         97 - openvdb.batch-broken (Failed)
         98 - texture-texture3d.batch-broken (Failed)
         99 - ptex-broken (Failed)
        115 - unit_simd (Failed)

And I found out:

texture-crop|texture-half|texture-uint16|texture-interp-bilinear|texture-interp-closest|texture-mip-onelevel|texture-mip-stochastictrilinear|texture-mip-stochasticaniso|texture-uint8|texture-skinny|unit_simd

These test cases support Intel but do not support ARM,Do you need to set some switches on different platforms to test during the test? Can you help me see how to solve it?

Tom-python0121 avatar Jun 19 '23 08:06 Tom-python0121

I don't think "do not support ARM" is quite right, but it may be that we have to update the reference output that it uses to verify the tests. Many of the tests are very challenging to have bit-for-bit identical output across different hardware platforms. The testsuite can accommodate this by having multiple valid reference outputs for each test, any one of which matching will let the test pass.

After running the full testsuite, can you zip up the entire contents of build/testsuite and put it someplace I can download? That will contain the non-passing output of all the failed tests, and I can look at them by hand and see which ones simply need additional valid reference output and which are failing for other reasons.

lgritz avatar Jun 19 '23 18:06 lgritz

Also, another thing to try: can you make sure that this environment variable is set:

PYTHONPATH=$PWD/build/lib/python/site-packages

and rerun the tests? (This assumes that PWD is where the oiio checkout lives.) Then any tests that continue to fail, zip up as I described before.

lgritz avatar Jun 19 '23 18:06 lgritz

@lgritz I have compressed the files in build/testsuite as follows: Because the file is too large, the error File size too big:25 MB are allowed,316 MB were attempted to upload. is reported, so I can only transfer a few failed compressed files.

testsuite_aarch64: cmake-consumer.tar.gz oiiotool-copy.tar.gz

x86-64: cmake-consumer.tar.gz oiiotool-copy.tar.gz

Tom-python0121 avatar Jun 20 '23 09:06 Tom-python0121

@lgritz Can you tell me how to test the test, what parameters need to be added, and how to add the Python dependency, but the problem is still reported?

Tom-python0121 avatar Jun 20 '23 09:06 Tom-python0121

One thing I can tell from the oiiotool-copy results is that your build of OpenImageIO was compiled without support for Freetype (perhaps that optional dependency was not found at build time, maybe you need to install it before building oiio), so any tests that involve rendering fonts into images are going to fail. That alone may explain most or possibly all of the failures.

Try tackling that first, and see how many additional tests it allows to pass.

lgritz avatar Jun 20 '23 17:06 lgritz

@lgritz Thank you for your help. I have solved the error reported by the oiiotool-copy. I would like to ask the cause of the error reported by the texture-* test case, as shown in the following figure. texture-udim.tar.gz texture-udim2.tar.gz

Tom-python0121 avatar Jun 21 '23 09:06 Tom-python0121

I'm not sure what's going wrong from those files. Can you try something else?

Please run ctest with these arguments:

cd build
ctest --force-new-ctest-process --output-on-failure  -R igrep >& test.log
tar czvf test.tgz test.log testsuite/igrep/*.*

And send that? That one seems like a simple test that if we understand why it fails, may give us some insight.

lgritz avatar Jun 25 '23 01:06 lgritz

Sorry, I got the third line wrong. It's edited now, but if you're reading this in email, it should have been

tar czvf test.tgz test.log testsuite/igrep/*.*

lgritz avatar Jun 25 '23 01:06 lgritz

@lgritz Following your steps, the installation package looks like this:

[root@localhost build]#ctest --force-new-ctest-process --output-on-failure  -R igrep
Test project /home/stage/root/spack-stage-openimageio-2.4.12.0-ielcu5qk3c4fvsfwkedokackkejg672j/spack-src/build
    Start 3: igrep
1/1 Test #3: igrep ............................   Passed    0.16 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) =   0.17 sec

I will follow this approach to check for other incorrect use cases.

Tom-python0121 avatar Jun 25 '23 01:06 Tom-python0121

Thanks. This output:

igrep: ../oiio-images/tahoe-gps.jpg: No such file or directory

makes me think that you do not have the oiio-images project checked out in the build/testsuite area, which contains many files that tests depend on.

I'm not quite sure how this is possible, because https://github.com/OpenImageIO/oiio/blob/master/src/cmake/testing.cmake#L406 should automatically download that when you first do the cmake configure step.

lgritz avatar Jun 25 '23 01:06 lgritz

@lgritz I'll try downloading oiio-images again to test the other errors.

Tom-python0121 avatar Jun 25 '23 01:06 Tom-python0121

@lgritz I'm now testing with the master version. The following error information is displayed: cmake-consumer.tar.gz oiiotool-text.tar.gz

Tom-python0121 avatar Jun 25 '23 08:06 Tom-python0121

I find that the test cases in the bin directory are different from the test cases in the ctest directory. The test cases in the ctest directory are not complete in the bin directory. Do you only need to test the test cases in the bing directory?

[root@localhost bin]#
argparse_test  compute_test     fmath_test  idiff  imagebuf_test      imageinout_test  maketx          parallel_test   span_test      strongparam_test  thread_test       typedesc_test
atomic_test    filesystem_test  hash_test   igrep  imagebufalgo_test  imagespec_test   oiiotool        paramlist_test  spin_rw_test   strutil_test      timer_test        ustring_test
color_test     filter_test      iconvert    iinfo  imagecache_test    imagespeed_test  optparser_test  simd_test       spinlock_test  testtex           type_traits_test

I found two test cases reporting errors: filesystem_test.log simd_test.log

Tom-python0121 avatar Jun 25 '23 09:06 Tom-python0121

The test programs in build/bin are low-level unit tests, mostly of utility classes, that consist largely of assertion-like tests of individual function calls. An example is typedesc_test.cpp which exercises the TypeDesc class in various ways. There tends to be one of these test programs for each important class. A tiny fraction of the overall testsuite consists of tests that directly call one of those specialized test programs.

The vast majority of the testsuite tests are python programs or invocations of oiiotool and other tools to fully exercise all the functionality of the library and utilities. An example would be testsuite/oiiotool-copy/run.py which invokes oiiotool 28 different times to test all sorts of different kinds of ways to copy files with oiiotool: can it copy and change the file format, can it copy and change the pixel data type, can it crop, can it copy just a subset of channels, can it copy just one subimage out of a file, etc. For most of these tests, it's producing images (sometimes many) or text output, and the tests pass or fail based on whether the outputs match saved outputs that we know are correct.

As far as the two in particular you mentioned:

simd_test doesn't surprise me at all that it fails on aarch64, given that it's an entirely different hardware architecture. It's possible that there are some LSB differences in the floating point math that could be responsible, and the solution is to have some of the assertions allow a little more room for the result to differ than what it expects.

For example, one of the things in that log where it failed was

/home/stage/root/spack-stage-openimageio-2.4.12.0-ielcu5qk3c4fvsfwkedokackkejg672j/spack-src/src/libutil/simd_test.cpp:1592:
FAILED: fast_log(expA) == mkvec<VEC>(fast_log(expA[0]), fast_log(expA[1]), fast_log(expA[2]), fast_log(expA[3]))
	values were '-1 0 1 4.5 -1 0 1 4.5' and '-1 0 1 4.5 -1 0 1 4.5'

And we can see that line here:

    OIIO_CHECK_SIMD_EQUAL_THRESH (fast_log(expA),
                mkvec<VEC>(fast_log(expA[0]), fast_log(expA[1]), fast_log(expA[2]), fast_log(expA[3])), 0.00001f);

It's already testing with some threshold, 0.00001, which maybe is enough on Intel but not ARM. The fact that there is a threshold there at all already (most of the tests do not have thresholds and test for exact results) indicates that this is a test where I already know that it's hard to achieve exact results from different platforms and allow for a tiny difference. Maybe it's just not enough for ARM and the threshold needs to be adjusted.

I'm not sure what's up with the filesystem_test... it might be that you are invoking it differently than it would be in CI, I'm just not sure yet.

lgritz avatar Jun 25 '23 23:06 lgritz

I should point out that the fast_log function it's testing is an approximation, so it's known to not be exactly the same as the IEEE-compliant std::log (but is much faster). It's not the least bit surprising to see it squeak past the threshold on a different hardware platform and need some adjustment of the thresholds.

lgritz avatar Jun 25 '23 23:06 lgritz

FYI: Arm and x86_64 are going to do bit identical math if you're using full precision math. The only places it would differ are the approximate instructions: rsqrt and rcp. The other fail case is if the compiler generates different math, such as with -ffast-math, in which case different compilers can generate slightly different results, and fusing such as by -ffp-contract which generates madd. I'm going to guess the slight differences in fast_log are due to the madd instructions being performed on one arch and not the other.

ThiagoIze avatar Jun 26 '23 03:06 ThiagoIze

Yeah, in the fast/approximate math functions, we use a lot of mad, knowing that it will be a true fused implementation on some platforms and a multiply followed by a separate add on others. I think it's telling that the simd test failures above were only when testing fast_exp, fast_log, and fast_pow_pos, all of which already had a tolerance in the comparison (whereas most of the testing in simd_test is exact), so even among the x86 machines, we clearly had experienced some platform-to-platform LSB differences on those particular ops.

The solution in this case is simple: just bump the tolerance a teeny bit. The trick is that without an ARM machine to test on, it's hard to know how much. (Another possible route is to really dissect these functions and figure out exactly where and why the scalar and simd implementations differ and decide if we should fix them. But considering it only affects these particular functions that are specifically advertised as being approximate, I'm not sure how much trouble it is worth to ensure it's bit identical.)

lgritz avatar Jun 26 '23 03:06 lgritz

Ha, fast_pow_pos is implemented with fast_exp and fast_log. So those are the two functions that probably have one more LSB differing than we accounted for.

lgritz avatar Jun 26 '23 03:06 lgritz

I've found that if you change the test method, many test cases can pass.

command:make -j126 USE_PYTHON=0 test

centos8_x86:
The following tests FAILED:
          1 - cmake-consumer (Failed)
          5 - oiiotool (Failed)
          8 - oiiotool-copy (Failed)
         12 - oiiotool-subimage (Failed)
         13 - oiiotool-text (Failed)
         19 - maketx (Failed)
         20 - oiiotool-maketx (Failed)
         49 - texture-udim (Failed)
         50 - texture-udim2 (Failed)
         83 - texture-udim.batch (Failed)
         84 - texture-udim2.batch (Failed)
Errors while running CTest
make: *** [Makefile:307: test] Error 8

centos8_aarch64:
The following tests FAILED:
          1 - cmake-consumer (Failed)
         13 - oiiotool-text (Failed)
         19 - maketx (Failed)
         20 - oiiotool-maketx (Failed)
         32 - texture-crop (Failed)
         34 - texture-half (Failed)
         35 - texture-uint16 (Failed)
         37 - texture-interp-bilinear (Failed)
         38 - texture-interp-closest (Failed)
         39 - texture-levels-stochaniso (Failed)
         40 - texture-levels-stochmip (Failed)
         42 - texture-mip-onelevel (Failed)
         44 - texture-mip-stochastictrilinear (Failed)
         45 - texture-mip-stochasticaniso (Failed)
         51 - texture-uint8 (Failed)
         55 - texture-skinny (Failed)
        108 - unit_simd (Failed)
Errors while running CTest
make: *** [Makefile:307: test] Error 8

Tom-python0121 avatar Jun 26 '23 09:06 Tom-python0121

 55/120 Test  #55: texture-skinny ..........................***Failed    0.53 sec
Comparing "out.exr" and "ref/out.exr"
512 x 512, 4 channels
  Mean error = 7.26273e-07
  RMS error = 3.6227e-05
  Peak SNR = 88.8194
  Max error  = 0.012207 @ (485, 41, R)  values are 0.8901367, 0.8901367, 0.8901367, 1 vs 0.8779297, 0.8779297, 0.8779297, 1
  2 pixels (0.000763%) over 0.008
  6 pixels (0.00229%) over 0.004
FAILURE
newsymlink /home/stage/root/spack-stage-openimageio-2.4.12.0-ielcu5qk3c4fvsfwkedokackkejg672j/spack-src/testsuite/texture-skinny/ref ./ref
newsymlink /home/stage/root/spack-stage-openimageio-2.4.12.0-ielcu5qk3c4fvsfwkedokackkejg672j/spack-src/testsuite/texture-skinny/src ./src
newsymlink /home/stage/root/spack-stage-openimageio-2.4.12.0-ielcu5qk3c4fvsfwkedokackkejg672j/spack-src/testsuite/texture-skinny ./data
command = ../../bin/testtex  src/vertgrid.tx  --scalest 4 1   >> out.txt  ;

comparisons are ['ref/out.exr']
comparing out.exr to ref/out.exr
NO MATCH for out.exr
FAIL out.exr

Tom-python0121 avatar Jun 27 '23 03:06 Tom-python0121