OpenColorIO Failing tests when FMA instruction is used

Hi,

Some tests are failing on riscv64 architecture because the results on riscv64 is more precise than other architectures.

Operating System: Arch Linux
Architecture: riscv64gc
OpenColorIO Version: 2.2.1

For example, test GammaOpCPU, apply_moncurve_mirror_style_fwd in tests/cpu/ops/gamma/GammaOpCPU_tests.cpp fails:

Detailed output from failed test

rgba[0] = 0.94786727428436279 0.83333331346511841 0.71428573131561279 0.625
rgba[1] = 0.052132699638605118 0.1666666716337204 0.28571429848670959 0.375
rgba[2] = 2.4000000953674316 2.2000000476837158 2 1.7999999523162842
rgba[3] = 0.039285715669393539 0.1666666716337204 0.40000000596046448 0.75
rgba[4] = 0.077380158007144928 0.44192609190940857 0.8163265585899353 0.98202729225158691
Iteration 0
sign = 1 1 1 1
pixel = 0.00050000002374872565 0.004999999888241291 0.05000000074505806 0.75
repro: pixel[2]=1028443341, blu[0] = 1060559726, blu[1] = 1049774373
intermediate = 0.052606634795665741 0.17083333432674408 0.32142859697341919 0.84375
data = 0.00085211289115250111 0.02049555629491806 0.10331634432077408 0.73652046918869019
out = 3.8690079236403108e-05 0.0022096303291618824 0.040816329419612885 0.73652046918869019
Iteration 1
sign = -1 -1 -1 -1
pixel = 0.00050000002374872565 0.004999999888241291 0.05000000074505806 0.75
repro: pixel[2]=1028443341, blu[0] = 1060559726, blu[1] = 1049774373
intermediate = 0.052606634795665741 0.17083333432674408 0.32142859697341919 0.84375
data = 0.00085211289115250111 0.02049555629491806 0.10331634432077408 0.73652046918869019
out = -3.8690079236403108e-05 -0.0022096303291618824 -0.040816329419612885 -0.73652046918869019
Iteration 2
sign = 1 1 1 1
pixel = 0.25 0.5 0.75 1
repro: pixel[2]=1061158912, blu[0] = 1060559726, blu[1] = 1049774373
intermediate = 0.28909951448440552 0.58333331346511841 0.82142859697341919 1
data = 0.050876077264547348 0.30550399422645569 0.67474496364593506 1
out = 0.050876077264547348 0.30550399422645569 0.67474496364593506 1
Iteration 3
sign = -1 -1 -1 -1
pixel = 0.25 0.5 0.75 1
repro: pixel[2]=1061158912, blu[0] = 1060559726, blu[1] = 1049774373
intermediate = 0.28909951448440552 0.58333331346511841 0.82142859697341919 1
data = 0.050876077264547348 0.30550399422645569 0.67474496364593506 1
out = -0.050876077264547348 -0.30550399422645569 -0.67474496364593506 -1
Iteration 4
sign = 1 1 1 1
pixel = 0.80000001192092896 0.94999998807907104 1 1.5
repro: pixel[2]=1065353216, blu[0] = 1060559726, blu[1] = 1049774373
intermediate = 0.81042653322219849 0.95833331346511841 1 1.3125
data = 0.60382729768753052 0.91061854362487793 1 1.6314687728881836
out = 0.60382729768753052 0.91061854362487793 1 1.6314687728881836
Iteration 5
sign = -1 -1 -1 -1
pixel = 0.80000001192092896 0.94999998807907104 1 1.5
repro: pixel[2]=1065353216, blu[0] = 1060559726, blu[1] = 1049774373
intermediate = 0.81042653322219849 0.95833331346511841 1 1.3125
data = 0.60382729768753052 0.91061854362487793 1 1.6314687728881836
out = -0.60382729768753052 -0.91061854362487793 -1 -1.6314687728881836
Iteration 6
sign = 1 1 1 1
pixel = 1.0049999952316284 1.0499999523162842 1.5 1
repro: pixel[2]=1069547520, blu[0] = 1060559726, blu[1] = 1049774373
intermediate = 1.0047392845153809 1.0416666269302368 1.3571429252624512 1
data = 1.0114120244979858 1.0939645767211914 1.8418369293212891 1
out = 1.0114120244979858 1.0939645767211914 1.8418369293212891 1
Iteration 7
sign = -1 -1 -1 -1
pixel = 1.0049999952316284 1.0499999523162842 1.5 1
repro: pixel[2]=1069547520, blu[0] = 1060559726, blu[1] = 1049774373
intermediate = 1.0047392845153809 1.0416666269302368 1.3571429252624512 1
data = 1.0114120244979858 1.0939645767211914 1.8418369293212891 1
out = -1.0114120244979858 -1.0939645767211914 -1.8418369293212891 -1
Iteration 8
sign = -1 1 1 1
pixel = inf inf nan 0
repro: pixel[2]=2143289344, blu[0] = 1060559726, blu[1] = 1049774373
intermediate = inf inf nan 0.375
data = inf inf nan 0.17110247910022736
out = -inf inf nan 0
Result[0] = 3.8689999200869352e-05
Image [0] = 3.8690079236403108e-05
Result[1] = 0.0022096300963312387
Image [1] = 0.0022096303291618824
Result[2] = 0.040816318243741989
Image [2] = 0.040816329419612885
Result[3] = 0.73652046918869019
Image [3] = 0.73652046918869019
Result[4] = -3.8689999200869352e-05
Image [4] = -3.8690079236403108e-05
Result[5] = -0.0022096300963312387
Image [5] = -0.0022096303291618824
Result[6] = -0.040816318243741989
Image [6] = -0.040816329419612885
Result[7] = -0.73652046918869019
Image [7] = -0.73652046918869019
Result[8] = 0.050876069813966751
Image [8] = 0.050876077264547348
Result[9] = 0.30550399422645569
Image [9] = 0.30550399422645569
Result[10] = 0.67474484443664551
Image [10] = 0.67474496364593506
/usr/src/debug/opencolorio/OpenColorIO-2.2.1/tests/cpu/ops/gamma/GammaOpCPU_tests.cpp:710:
FAILED: Index: 10 - Values: 0.674744964 and: 0.674744844 - Threshold: 1.00000001e-07
Result[11] =                1
Image [11] =                1
Result[12] = -0.050876069813966751
Image [12] = -0.050876077264547348
Result[13] = -0.30550399422645569
Image [13] = -0.30550399422645569
Result[14] = -0.67474484443664551
Image [14] = -0.67474496364593506
/usr/src/debug/opencolorio/OpenColorIO-2.2.1/tests/cpu/ops/gamma/GammaOpCPU_tests.cpp:710:
FAILED: Index: 14 - Values: -0.674744964 and: -0.674744844 - Threshold: 1.00000001e-07
Result[15] =               -1
Image [15] =               -1
Result[16] = 0.60382729768753052
Image [16] = 0.60382729768753052
Result[17] = 0.91061854362487793
Image [17] = 0.91061854362487793
Result[18] =                1
Image [18] =                1
Result[19] = 1.6314687728881836
Image [19] = 1.6314687728881836
Result[20] = -0.60382729768753052
Image [20] = -0.60382729768753052
Result[21] = -0.91061854362487793
Image [21] = -0.91061854362487793
Result[22] =               -1
Image [22] =               -1
Result[23] = -1.6314687728881836
Image [23] = -1.6314687728881836
Result[24] = 1.0114120244979858
Image [24] = 1.0114120244979858
Result[25] = 1.0939645767211914
Image [25] = 1.0939645767211914
Result[26] = 1.8418365716934204
Image [26] = 1.8418369293212891
/usr/src/debug/opencolorio/OpenColorIO-2.2.1/tests/cpu/ops/gamma/GammaOpCPU_tests.cpp:710:
FAILED: Index: 26 - Values: 1.84183693 and: 1.84183657 - Threshold: 1.00000001e-07
Result[27] =                1
Image [27] =                1
Result[28] = -1.0114120244979858
Image [28] = -1.0114120244979858
Result[29] = -1.0939645767211914
Image [29] = -1.0939645767211914
Result[30] = -1.8418365716934204
Image [30] = -1.8418369293212891
/usr/src/debug/opencolorio/OpenColorIO-2.2.1/tests/cpu/ops/gamma/GammaOpCPU_tests.cpp:710:
FAILED: Index: 30 - Values: -1.84183693 and: -1.84183657 - Threshold: 1.00000001e-07
Result[31] =               -1
Image [31] =               -1
Result[32] =             -inf
Image [32] =             -inf
Result[33] =              inf
Image [33] =              inf
Result[34] =              nan
Image [34] =              nan
Result[35] =                0
Image [35] =                0

Note that I modified src/OpenColorIO/ops/gamma/GammaOpCPU.cpp and tests/cpu/ops/gamma/GammaOpCPU.cpp to get more outputs.

Modified source

void GammaMoncurveMirrorOpCPUFwd::apply(const void* inImg, void* outImg, long numPixels) const
{
  const float* in = (const float*)inImg;
  float* out = (float*)outImg;
  const float red[5] = { m_red.scale, m_red.offset, m_red.gamma, m_red.breakPnt, m_red.slope };
  const float grn[5] = { m_green.scale, m_green.offset, m_green.gamma, m_green.breakPnt, m_green.slope };
  const float blu[5] = { m_blue.scale, m_blue.offset, m_blue.gamma, m_blue.breakPnt, m_blue.slope };
  const float alp[5] = { m_alpha.scale, m_alpha.offset, m_alpha.gamma, m_alpha.breakPnt, m_alpha.slope };
  std::cout.precision(17);
  for (int i = 0; i < 5; i++)
  {
    std::cout << "rgba[" << i << "] = " << red[i] << " " << grn[i] << " " << blu[i] << " " << alp[i] << std::endl;
  }
  for (long idx = 0; idx < numPixels; ++idx)
  {
    std::cout << "Iteration " << idx << std::endl;

    const float sign[4] = { std::copysign(1.0f, in[0]), std::copysign(1.0f, in[1]), std::copysign(1.0f, in[2]),
                            std::copysign(1.0f, in[3]) };
    std::cout << "sign = " << sign[0] << " " << sign[1] << " " << sign[2] << " " << sign[3] << std::endl;

    const float pixel[4] = { std::fabs(in[0]), std::fabs(in[1]), std::fabs(in[2]), std::fabs(in[3]) };

    std::cout << "pixel = " << pixel[0] << " " << pixel[1] << " " << pixel[2] << " " << pixel[3] << std::endl;

    std::cout << "repro: pixel[2]=" << std::hex << *(int*)(&pixel[2]) << ", blu[0] = " << *(int*)(&blu[0])
              << ", blu[1] = " << *(int*)(&blu[1]) << std::endl;

    const float intermediate[4] = { pixel[0] * red[0] + red[1], pixel[1] * grn[0] + grn[1], pixel[2] * blu[0] + blu[1],
                                    pixel[3] * alp[0] + alp[1] };

    std::cout << "intermediate = " << intermediate[0] << " " << intermediate[1] << " " << intermediate[2] << " "
              << intermediate[3] << std::endl;

    const float data[4] = { std::pow(intermediate[0], red[2]), std::pow(intermediate[1], grn[2]),
                            std::pow(intermediate[2], blu[2]), std::pow(intermediate[3], alp[2]) };

    std::cout << "data = " << data[0] << " " << data[1] << " " << data[2] << " " << data[3] << std::endl;

    out[0] = sign[0] * (pixel[0] <= red[3] ? pixel[0] * red[4] : data[0]);
    out[1] = sign[1] * (pixel[1] <= grn[3] ? pixel[1] * grn[4] : data[1]);
    out[2] = sign[2] * (pixel[2] <= blu[3] ? pixel[2] * blu[4] : data[2]);
    out[3] = sign[3] * (pixel[3] <= alp[3] ? pixel[3] * alp[4] : data[3]);

    std::cout << "out = " << out[0] << " " << out[1] << " " << out[2] << " " << out[3] << std::endl;

    in += 4;
    out += 4;
}

void ApplyGamma(const OCIO::OpRcPtr & op,
float * image, const float * result,
long numPixels, unsigned line,
float errorThreshold)
{
std::cout.precision(10);
const auto cpu = op->getCPUOp(true);

OCIO_CHECK_NO_THROW_FROM(cpu->apply(image, image, numPixels), line);

for(long idx=0; idx<(numPixels*4); ++idx)
{
std::cout << "Result[" << idx << "] = " << std::setw(16) << result[idx] << std::endl;
std::cout << "Image [" << idx << "] = " << std::setw(16) << image[idx] << std::endl;
if (OCIO::IsNan(result[idx]))
{
OCIO_CHECK_ASSERT_FROM(OCIO::IsNan(image[idx]), line);
continue;
}
// Using rel error with a large minExpected value of 1 will transition
// from absolute error for expected values < 1 and
// relative error for values > 1.
const bool equalRel = OCIO::EqualWithSafeRelError(image[idx], result[idx],
errorThreshold, 1.0f);
if (!equalRel)
{
// As most of the error thresholds are 1e-7f, the output
// value precision should then be bigger than 7 digits
// to highlight small differences.
std::ostringstream message;
message.precision(9);
message << "Index: " << idx
<< " - Values: " << image[idx]
<< " and: " << result[idx]
<< " - Threshold: " << errorThreshold;
OCIO_CHECK_ASSERT_MESSAGE_FROM(0, message.str(), line);
}
}
}

Let's take a detailed look at the Iteration 2 from the log:

Iteration 2
sign = 1 1 1 1
pixel = 0.25 0.5 0.75 1
repro: pixel[2]=1061158912, blu[0] = 1060559726, blu[1] = 1049774373
intermediate = 0.28909951448440552 0.58333331346511841 0.82142859697341919 1
data = 0.050876077264547348 0.30550399422645569 0.67474496364593506 1
out = 0.050876077264547348 0.30550399422645569 0.67474496364593506 1

On x86_64 with SSE2 disabled, it looks like:

Iteration 2
sign = 1 1 1 1
pixel = 0.25 0.5 0.75 1
repro: pixel[2]=1061158912, blu[0] = 1060559726, blu[1] = 1049774373
intermediate = 0.28909951448440552 0.58333331346511841 0.82142853736877441 1
data = 0.050876077264547348 0.30550399422645569 0.67474484443664551 1
out = 0.050876077264547348 0.30550399422645569 0.67474484443664551 1

The difference of the 3rd element(at index 2) causes a test failure later.

x86_64 : intermediate[2] = 0.82142853736877441
riscv64: intermediate[2] = 0.82142859697341919

After some investigation, I found that the difference is caused by FMA(Fused Multiply-Add).

The following expression

pixel[2] * blu[0] + blu[1]

can be computed with two steps:

fmul.s <a>, <a>, <b>
fadd.s <result>,<a> , <c>

It is equivalent to round(round(a * b) + c), which is the behavior on x86_64.

However, with the current build config in this repo, gcc uses fmadd.s to compute the expr.

fmadd.s <result_register> <a> <b> <c> computes a * b + c as round(a * b + c), which is of higher precision than round(round(a * b) + c). And this difference finally caused the test failure.

There are several ways to fix this and I would like to ask for your opinions.

Adjust the expected values and error threshold in the test . In my opinion, this is the most correct solution but it will involve a lot of work.
Add the compiler flag: -ffp-contract=off to prevent the compiler from generating FMA instructions. Apparently this is not recommended because we will lose the benefits of the higher precision we get for free from FMA.

Apr 03 '23 11:04 kxxt

This is also reproducible on x86_64 platforms with the following steps

CXXFLAGS='-mfma' cmake -GNinja  -Bbuild  -DOCIO_USE_SSE=OFF
ninja -C build
cd build/tests/cpu
./test_cpu_exec

Failed tests: https://pastebin.pl/view/fd57c16e

I can send a PR to fix the failing tests when FMA is used.

Apr 12 '23 01:04 kxxt

Thank you for investigating this @kxxt. I think we would welcome a PR that fixes the test as long as the tests for our usual compilers don't get significantly worse. So there should be some mechanism for deciding when to use the looser tests. Do you have a suggestion for how to do that?

Thanks for posting the output from the failing tests. It looks like the first one will already be fixed in this pending PR: https://github.com/AcademySoftwareFoundation/OpenColorIO/pull/1687

The following ones in CPUProcessor_tests.cpp look like they could be fixed by allowing a tolerance of +/- 1. However, we should do this in a way that does not remove the exact test for our usual compilers, some sort of "loose mode".

The file format tests could be skipped when in this mode.

The GammaOpCPU_tests.cpp tolerance could be loosened to 2e-7 when in this mode.

Apr 13 '23 03:04 doug-walker

Hi @doug-walker ,

So there should be some mechanism for deciding when to use the looser tests.

I don't think there is any reliable way to detect FMA other than compiling a sample program and compare its output value. For gcc, -ffp-contract is set to fast by default. The bug doesn't occur on x86_64 because gcc by default targets generic x86_64 that doesn't have FMA instructions.

as long as the tests for our usual compilers don't get significantly worse.

Technically speaking, the tests are incorrect and looser tests are correct. The language standard makes no guarantee that we will get the hard-coded results in the tests. Whether and how expressions are contracted is implementation-defined. [^1]

A floating expression may be contracted, that is, evaluated as though it were an atomic operation, thereby omitting rounding errors implied by the source code and the expression evaluation method. The FP_CONTRACT pragma in <math.h> provides a way to disallow contracted expressions. Otherwise, whether and how expressions are contracted is implementation-defined.

[^1]: C99 standard, page 80, https://www.dii.uchile.cl/~daespino/files/Iso_C_1999_definition.pdf

Apr 13 '23 04:04 kxxt

#1950 might have resolved this issue. It fixes a few unit test failures related to FMA on apple silicon.

Apr 29 '24 20:04 markreidvfx