OpenColorIO icon indicating copy to clipboard operation
OpenColorIO copied to clipboard

Failing tests when FMA instruction is used

Open kxxt opened this issue 1 year ago • 4 comments

Hi,

Some tests are failing on riscv64 architecture because the results on riscv64 is more precise than other architectures.

  • Operating System: Arch Linux
  • Architecture: riscv64gc
  • OpenColorIO Version: 2.2.1

For example, test GammaOpCPU, apply_moncurve_mirror_style_fwd in tests/cpu/ops/gamma/GammaOpCPU_tests.cpp fails:

Detailed output from failed test
rgba[0] = 0.94786727428436279 0.83333331346511841 0.71428573131561279 0.625
rgba[1] = 0.052132699638605118 0.1666666716337204 0.28571429848670959 0.375
rgba[2] = 2.4000000953674316 2.2000000476837158 2 1.7999999523162842
rgba[3] = 0.039285715669393539 0.1666666716337204 0.40000000596046448 0.75
rgba[4] = 0.077380158007144928 0.44192609190940857 0.8163265585899353 0.98202729225158691
Iteration 0
sign = 1 1 1 1
pixel = 0.00050000002374872565 0.004999999888241291 0.05000000074505806 0.75
repro: pixel[2]=1028443341, blu[0] = 1060559726, blu[1] = 1049774373
intermediate = 0.052606634795665741 0.17083333432674408 0.32142859697341919 0.84375
data = 0.00085211289115250111 0.02049555629491806 0.10331634432077408 0.73652046918869019
out = 3.8690079236403108e-05 0.0022096303291618824 0.040816329419612885 0.73652046918869019
Iteration 1
sign = -1 -1 -1 -1
pixel = 0.00050000002374872565 0.004999999888241291 0.05000000074505806 0.75
repro: pixel[2]=1028443341, blu[0] = 1060559726, blu[1] = 1049774373
intermediate = 0.052606634795665741 0.17083333432674408 0.32142859697341919 0.84375
data = 0.00085211289115250111 0.02049555629491806 0.10331634432077408 0.73652046918869019
out = -3.8690079236403108e-05 -0.0022096303291618824 -0.040816329419612885 -0.73652046918869019
Iteration 2
sign = 1 1 1 1
pixel = 0.25 0.5 0.75 1
repro: pixel[2]=1061158912, blu[0] = 1060559726, blu[1] = 1049774373
intermediate = 0.28909951448440552 0.58333331346511841 0.82142859697341919 1
data = 0.050876077264547348 0.30550399422645569 0.67474496364593506 1
out = 0.050876077264547348 0.30550399422645569 0.67474496364593506 1
Iteration 3
sign = -1 -1 -1 -1
pixel = 0.25 0.5 0.75 1
repro: pixel[2]=1061158912, blu[0] = 1060559726, blu[1] = 1049774373
intermediate = 0.28909951448440552 0.58333331346511841 0.82142859697341919 1
data = 0.050876077264547348 0.30550399422645569 0.67474496364593506 1
out = -0.050876077264547348 -0.30550399422645569 -0.67474496364593506 -1
Iteration 4
sign = 1 1 1 1
pixel = 0.80000001192092896 0.94999998807907104 1 1.5
repro: pixel[2]=1065353216, blu[0] = 1060559726, blu[1] = 1049774373
intermediate = 0.81042653322219849 0.95833331346511841 1 1.3125
data = 0.60382729768753052 0.91061854362487793 1 1.6314687728881836
out = 0.60382729768753052 0.91061854362487793 1 1.6314687728881836
Iteration 5
sign = -1 -1 -1 -1
pixel = 0.80000001192092896 0.94999998807907104 1 1.5
repro: pixel[2]=1065353216, blu[0] = 1060559726, blu[1] = 1049774373
intermediate = 0.81042653322219849 0.95833331346511841 1 1.3125
data = 0.60382729768753052 0.91061854362487793 1 1.6314687728881836
out = -0.60382729768753052 -0.91061854362487793 -1 -1.6314687728881836
Iteration 6
sign = 1 1 1 1
pixel = 1.0049999952316284 1.0499999523162842 1.5 1
repro: pixel[2]=1069547520, blu[0] = 1060559726, blu[1] = 1049774373
intermediate = 1.0047392845153809 1.0416666269302368 1.3571429252624512 1
data = 1.0114120244979858 1.0939645767211914 1.8418369293212891 1
out = 1.0114120244979858 1.0939645767211914 1.8418369293212891 1
Iteration 7
sign = -1 -1 -1 -1
pixel = 1.0049999952316284 1.0499999523162842 1.5 1
repro: pixel[2]=1069547520, blu[0] = 1060559726, blu[1] = 1049774373
intermediate = 1.0047392845153809 1.0416666269302368 1.3571429252624512 1
data = 1.0114120244979858 1.0939645767211914 1.8418369293212891 1
out = -1.0114120244979858 -1.0939645767211914 -1.8418369293212891 -1
Iteration 8
sign = -1 1 1 1
pixel = inf inf nan 0
repro: pixel[2]=2143289344, blu[0] = 1060559726, blu[1] = 1049774373
intermediate = inf inf nan 0.375
data = inf inf nan 0.17110247910022736
out = -inf inf nan 0
Result[0] = 3.8689999200869352e-05
Image [0] = 3.8690079236403108e-05
Result[1] = 0.0022096300963312387
Image [1] = 0.0022096303291618824
Result[2] = 0.040816318243741989
Image [2] = 0.040816329419612885
Result[3] = 0.73652046918869019
Image [3] = 0.73652046918869019
Result[4] = -3.8689999200869352e-05
Image [4] = -3.8690079236403108e-05
Result[5] = -0.0022096300963312387
Image [5] = -0.0022096303291618824
Result[6] = -0.040816318243741989
Image [6] = -0.040816329419612885
Result[7] = -0.73652046918869019
Image [7] = -0.73652046918869019
Result[8] = 0.050876069813966751
Image [8] = 0.050876077264547348
Result[9] = 0.30550399422645569
Image [9] = 0.30550399422645569
Result[10] = 0.67474484443664551
Image [10] = 0.67474496364593506
/usr/src/debug/opencolorio/OpenColorIO-2.2.1/tests/cpu/ops/gamma/GammaOpCPU_tests.cpp:710:
FAILED: Index: 10 - Values: 0.674744964 and: 0.674744844 - Threshold: 1.00000001e-07
Result[11] =                1
Image [11] =                1
Result[12] = -0.050876069813966751
Image [12] = -0.050876077264547348
Result[13] = -0.30550399422645569
Image [13] = -0.30550399422645569
Result[14] = -0.67474484443664551
Image [14] = -0.67474496364593506
/usr/src/debug/opencolorio/OpenColorIO-2.2.1/tests/cpu/ops/gamma/GammaOpCPU_tests.cpp:710:
FAILED: Index: 14 - Values: -0.674744964 and: -0.674744844 - Threshold: 1.00000001e-07
Result[15] =               -1
Image [15] =               -1
Result[16] = 0.60382729768753052
Image [16] = 0.60382729768753052
Result[17] = 0.91061854362487793
Image [17] = 0.91061854362487793
Result[18] =                1
Image [18] =                1
Result[19] = 1.6314687728881836
Image [19] = 1.6314687728881836
Result[20] = -0.60382729768753052
Image [20] = -0.60382729768753052
Result[21] = -0.91061854362487793
Image [21] = -0.91061854362487793
Result[22] =               -1
Image [22] =               -1
Result[23] = -1.6314687728881836
Image [23] = -1.6314687728881836
Result[24] = 1.0114120244979858
Image [24] = 1.0114120244979858
Result[25] = 1.0939645767211914
Image [25] = 1.0939645767211914
Result[26] = 1.8418365716934204
Image [26] = 1.8418369293212891
/usr/src/debug/opencolorio/OpenColorIO-2.2.1/tests/cpu/ops/gamma/GammaOpCPU_tests.cpp:710:
FAILED: Index: 26 - Values: 1.84183693 and: 1.84183657 - Threshold: 1.00000001e-07
Result[27] =                1
Image [27] =                1
Result[28] = -1.0114120244979858
Image [28] = -1.0114120244979858
Result[29] = -1.0939645767211914
Image [29] = -1.0939645767211914
Result[30] = -1.8418365716934204
Image [30] = -1.8418369293212891
/usr/src/debug/opencolorio/OpenColorIO-2.2.1/tests/cpu/ops/gamma/GammaOpCPU_tests.cpp:710:
FAILED: Index: 30 - Values: -1.84183693 and: -1.84183657 - Threshold: 1.00000001e-07
Result[31] =               -1
Image [31] =               -1
Result[32] =             -inf
Image [32] =             -inf
Result[33] =              inf
Image [33] =              inf
Result[34] =              nan
Image [34] =              nan
Result[35] =                0
Image [35] =                0

Note that I modified src/OpenColorIO/ops/gamma/GammaOpCPU.cpp and tests/cpu/ops/gamma/GammaOpCPU.cpp to get more outputs.

Modified source
void GammaMoncurveMirrorOpCPUFwd::apply(const void* inImg, void* outImg, long numPixels) const
{
  const float* in = (const float*)inImg;
  float* out = (float*)outImg;
  const float red[5] = { m_red.scale, m_red.offset, m_red.gamma, m_red.breakPnt, m_red.slope };
  const float grn[5] = { m_green.scale, m_green.offset, m_green.gamma, m_green.breakPnt, m_green.slope };
  const float blu[5] = { m_blue.scale, m_blue.offset, m_blue.gamma, m_blue.breakPnt, m_blue.slope };
  const float alp[5] = { m_alpha.scale, m_alpha.offset, m_alpha.gamma, m_alpha.breakPnt, m_alpha.slope };
  std::cout.precision(17);
  for (int i = 0; i < 5; i++)
  {
    std::cout << "rgba[" << i << "] = " << red[i] << " " << grn[i] << " " << blu[i] << " " << alp[i] << std::endl;
  }
  for (long idx = 0; idx < numPixels; ++idx)
  {
    std::cout << "Iteration " << idx << std::endl;

    const float sign[4] = { std::copysign(1.0f, in[0]), std::copysign(1.0f, in[1]), std::copysign(1.0f, in[2]),
                            std::copysign(1.0f, in[3]) };
    std::cout << "sign = " << sign[0] << " " << sign[1] << " " << sign[2] << " " << sign[3] << std::endl;

    const float pixel[4] = { std::fabs(in[0]), std::fabs(in[1]), std::fabs(in[2]), std::fabs(in[3]) };

    std::cout << "pixel = " << pixel[0] << " " << pixel[1] << " " << pixel[2] << " " << pixel[3] << std::endl;

    std::cout << "repro: pixel[2]=" << std::hex << *(int*)(&pixel[2]) << ", blu[0] = " << *(int*)(&blu[0])
              << ", blu[1] = " << *(int*)(&blu[1]) << std::endl;

    const float intermediate[4] = { pixel[0] * red[0] + red[1], pixel[1] * grn[0] + grn[1], pixel[2] * blu[0] + blu[1],
                                    pixel[3] * alp[0] + alp[1] };

    std::cout << "intermediate = " << intermediate[0] << " " << intermediate[1] << " " << intermediate[2] << " "
              << intermediate[3] << std::endl;

    const float data[4] = { std::pow(intermediate[0], red[2]), std::pow(intermediate[1], grn[2]),
                            std::pow(intermediate[2], blu[2]), std::pow(intermediate[3], alp[2]) };

    std::cout << "data = " << data[0] << " " << data[1] << " " << data[2] << " " << data[3] << std::endl;

    out[0] = sign[0] * (pixel[0] <= red[3] ? pixel[0] * red[4] : data[0]);
    out[1] = sign[1] * (pixel[1] <= grn[3] ? pixel[1] * grn[4] : data[1]);
    out[2] = sign[2] * (pixel[2] <= blu[3] ? pixel[2] * blu[4] : data[2]);
    out[3] = sign[3] * (pixel[3] <= alp[3] ? pixel[3] * alp[4] : data[3]);

    std::cout << "out = " << out[0] << " " << out[1] << " " << out[2] << " " << out[3] << std::endl;

    in += 4;
    out += 4;
}
void ApplyGamma(const OCIO::OpRcPtr & op,
float * image, const float * result,
long numPixels, unsigned line,
float errorThreshold)
{
std::cout.precision(10);
const auto cpu = op->getCPUOp(true);

OCIO_CHECK_NO_THROW_FROM(cpu->apply(image, image, numPixels), line);

for(long idx=0; idx<(numPixels*4); ++idx)
{
std::cout << "Result[" << idx << "] = " << std::setw(16) << result[idx] << std::endl;
std::cout << "Image [" << idx << "] = " << std::setw(16) << image[idx] << std::endl;
if (OCIO::IsNan(result[idx]))
{
OCIO_CHECK_ASSERT_FROM(OCIO::IsNan(image[idx]), line);
continue;
}
// Using rel error with a large minExpected value of 1 will transition
// from absolute error for expected values < 1 and
// relative error for values > 1.
const bool equalRel = OCIO::EqualWithSafeRelError(image[idx], result[idx],
errorThreshold, 1.0f);
if (!equalRel)
{
// As most of the error thresholds are 1e-7f, the output
// value precision should then be bigger than 7 digits
// to highlight small differences.
std::ostringstream message;
message.precision(9);
message << "Index: " << idx
<< " - Values: " << image[idx]
<< " and: " << result[idx]
<< " - Threshold: " << errorThreshold;
OCIO_CHECK_ASSERT_MESSAGE_FROM(0, message.str(), line);
}
}
}

Let's take a detailed look at the Iteration 2 from the log:

Iteration 2
sign = 1 1 1 1
pixel = 0.25 0.5 0.75 1
repro: pixel[2]=1061158912, blu[0] = 1060559726, blu[1] = 1049774373
intermediate = 0.28909951448440552 0.58333331346511841 0.82142859697341919 1
data = 0.050876077264547348 0.30550399422645569 0.67474496364593506 1
out = 0.050876077264547348 0.30550399422645569 0.67474496364593506 1

On x86_64 with SSE2 disabled, it looks like:

Iteration 2
sign = 1 1 1 1
pixel = 0.25 0.5 0.75 1
repro: pixel[2]=1061158912, blu[0] = 1060559726, blu[1] = 1049774373
intermediate = 0.28909951448440552 0.58333331346511841 0.82142853736877441 1
data = 0.050876077264547348 0.30550399422645569 0.67474484443664551 1
out = 0.050876077264547348 0.30550399422645569 0.67474484443664551 1

The difference of the 3rd element(at index 2) causes a test failure later.

x86_64 : intermediate[2] = 0.82142853736877441
riscv64: intermediate[2] = 0.82142859697341919

After some investigation, I found that the difference is caused by FMA(Fused Multiply-Add).

The following expression

pixel[2] * blu[0] + blu[1]

can be computed with two steps:

fmul.s <a>, <a>, <b>
fadd.s <result>,<a> , <c>

It is equivalent to round(round(a * b) + c), which is the behavior on x86_64.

However, with the current build config in this repo, gcc uses fmadd.s to compute the expr.

fmadd.s <result_register> <a> <b> <c> computes a * b + c as round(a * b + c), which is of higher precision than round(round(a * b) + c). And this difference finally caused the test failure.

There are several ways to fix this and I would like to ask for your opinions.

  1. Adjust the expected values and error threshold in the test . In my opinion, this is the most correct solution but it will involve a lot of work.
  2. Add the compiler flag: -ffp-contract=off to prevent the compiler from generating FMA instructions. Apparently this is not recommended because we will lose the benefits of the higher precision we get for free from FMA.

kxxt avatar Apr 03 '23 11:04 kxxt