OpenColorIO icon indicating copy to clipboard operation
OpenColorIO copied to clipboard

Getting NaNs out of CPU processor

Open fannychaleon-fn opened this issue 3 years ago • 4 comments

Hello, I'm getting a really weird issue on a Windows machine that I can't reproduce on other machines or Linux. The following snippet of code produces nans which is causing crashes later down the line:

    OCIO::ConstConfigRcPtr config = OCIO::GetCurrentConfig();
    OCIO::DisplayViewTransformRcPtr transform = OCIO::DisplayViewTransform::Create();
    transform->setSrc(OCIO::ROLE_SCENE_LINEAR);
    transform->setDisplay("default");
    transform->setView("sRGB");

    OCIO::ConstProcessorRcPtr processor = config->getProcessor(transform);
    int64_t width = 96;
    int64_t height = 96;
    std::vector<float> data(width * height * 4, 0.0f);
    // Initialize image
    initImage(data);

    OCIO::BitDepth bitDepth = OCIO::BIT_DEPTH_F32;
    ptrdiff_t chanStrideBytes = sizeof(float);
    ptrdiff_t xStrideBytes = sizeof(float) * 4;
    ptrdiff_t yStrideBytes = 1536;

    OCIO::PackedImageDesc imageDesc(data.data(), width, height, 4, bitDepth,
        chanStrideBytes, xStrideBytes, yStrideBytes);
    OCIO::ConstCPUProcessorRcPtr cpu = processor->getDefaultCPUProcessor();
    cpu->apply(imageDesc);

    const float* outImg = reinterpret_cast<float*>(imageDesc.getData());
    for (size_t pxl = 0; pxl < width * height * 4; ++pxl)
    {
        if (isnan(outImg[pxl]))
        {
            std::cerr << "NaN found for pixel " << pxl << std::endl;
        }
    }

What system info would you need ? I'm building on Windows 10 with MSVC 14.29, using the nuke-default config. Please find enclosed the complete code that sets up an image pixel by pixel then processes it. main.zip

Best

Fanny

fannychaleon-fn avatar Oct 25 '22 14:10 fannychaleon-fn

Which index had the NaN? Did you run your NaN check on your data before calling apply? If it is fine but the output is not, we'd need to see the .ocio and associated LUTs to help further.

SonyDennisAdams avatar Oct 25 '22 17:10 SonyDennisAdams

Yes I ran the same test for NaNs after creating the PackedImageDesc image and before the apply and the data is fine. Here is the output:

Start OCIO test
NaN found for pixel 8241
NaN found for pixel 12478
NaN found for pixel 12862
NaN found for pixel 13246
NaN found for pixel 13282
NaN found for pixel 13630
NaN found for pixel 14013
NaN found for pixel 14021
NaN found for pixel 15144
NaN found for pixel 15202
NaN found for pixel 16672
NaN found for pixel 17442
NaN found for pixel 17825
NaN found for pixel 20508
NaN found for pixel 23954
NaN found for pixel 28566
NaN found for pixel 29724
NaN found for pixel 30102
End OCIO test

The config I've used is nuke-default from the master branch of https://github.com/colour-science/OpenColorIO-Configs. Another note is that using set OCIO_OPTIMIZATION_FLAGS=230440899 as advised here https://github.com/AcademySoftwareFoundation/OpenColorIO/pull/1614 fixes it. Best

fannychaleon-fn avatar Oct 26 '22 08:10 fannychaleon-fn

I'm glad you found #1614 -- based on what I'm reading there, it is most likely what you are seeing. Even though the nuke-default config does not use half-domain LUTs, you are doing an inverse operation (linear to sRGB and only the sRGB to linear LUT is provided in that configuration) so OCIOv2 is baking a half-domain LUT for the inverse, which exposes that bug. Since you already found the fix, I think you're good, right?

Note that all of failing pixels have similar magnitudes (e-05 and e-06) which aligns with the bug description. image[8241] = 5.79741426918189972639083862305e-05; image[12478] = 4.54930668638553470373153686523e-05; image[12862] = 4.34900321124587208032608032227e-05; image[13246] = 5.74186597077641636133193969727e-05; image[13282] = 2.84011161966191139072179794312e-06; image[13630] = 5.88868424529209733009338378906e-05; image[14013] = 5.7244138588430359959602355957e-05; image[14021] = 2.32401589528308250010013580322e-06; image[15144] = 5.72872231714427471160888671875e-05; image[15202] = 5.27944721397943794727325439453e-05; image[16672] = 3.53323139279382303357124328613e-06; image[17442] = 3.13506425300147384405136108398e-05; image[17825] = 3.63670224032830446958541870117e-05; image[20508] = 3.30972652591299265623092651367e-05; image[23954] = 5.94633311266079545021057128906e-05; image[28566] = 4.56161797046661376953125e-05; image[29724] = 3.63016151823103427886962890625e-05; image[30102] = 3.77574224330601282417774200439e-06;

The min value is 2.32401589528308E-06 and the max value is 5.94633311266079545021057128906e-05.

There are 82 values initialized within that range but only 18 are listed by your code, but it could be related to what RGBA channel they represent.

SonyDennisAdams avatar Oct 26 '22 16:10 SonyDennisAdams

Unfortunately we've pulled the fix for https://github.com/AcademySoftwareFoundation/OpenColorIO/pull/1614 as we're using OCIO 2.1.2 and it doesn't solve this one. We don't want the optimisation flags to be changed as it might affect performance.

fannychaleon-fn avatar Oct 26 '22 16:10 fannychaleon-fn

I followed up with @fannychaleon-fn on this. It's only reproducible on one of our Windows machines and having looked at our internal ticket I think it's possibly related to #1764, as it's the same machine in both cases. We're happy for this one to be closed @doug-walker @carolalynn

mjtitchener-fn avatar Apr 17 '24 11:04 mjtitchener-fn

Thanks Mark, will close this one. It would be interesting to know if #1764 could also be closed, if it was fixed by the bug fix in imath.

doug-walker avatar Apr 17 '24 17:04 doug-walker

@doug-walker Yes I'll update on #1764 when we've been able to test these. We don't have builds with the updated Imath just yet.

mjtitchener-fn avatar Apr 18 '24 09:04 mjtitchener-fn