Large performance difference using the GPU path between OCIO v1 and OCIO v2
We are investigating a performance issue in Nuke Studio when moving from OCIO v1.1.1 to OCIO v2.1.2. The use case was to play back a 4K EXR sequence at 60fps with a number of OCIO effects applied to the video. In this particular test case we have one Display Transform, one ColorSpace Transform and a number of CDL Transform objects applied to the video. Playing back the video at 60fps using OCIO v1.1.1 was not a problem, but the playback sequence was unable to maintain 60fps on OCIO v2.1.2. We seem to have a performance problem when a number of LUTs are being applied using the GPUProcessor path. The slow down is more noticeable on Windows and Linux than on the MacOS platform.
I’ve tracked the problem to the way that OCIO handles the textures caching, as in none at all on the user’s facing side. After obtaining an object of type GpuShaderDescRcPtr from the GPUProcessor, the new API to setup the textures for the OCIO transform is GPUProcessor::extractGpuShaderInfo(GpuShaderDescRcPtr & shaderDesc). This API cycle through the Op list and calls extractGpuShaderInfo on each of the Op.
The call chain goes something like:
GPUProcessor::extractGpuShaderInfo -> GPUProcessor::Impl::extractGpuShaderInfo -> Lut3DOp::extractGpuShaderInfo -> GetLut3DGPUShaderProgram
GetLut3DGPUShaderProgram then calls shaderCreator->add3DTexture() to add a texture, (the shaderCreator object was derived from the created GpuShaderDescRcPtr). This Texture object was created as an instance of struct Texture. The constructor for Texture has this bit of code in it to get a copy of the texture from the Op’s lutData.
// An unfortunate copy is mandatory to allow the creation of a GPU shader cache.
// The cache needs a decoupling of the processor and shader instances forbidding
// shared naked pointer usage.
CreateArray(v, m_width, m_height, m_depth, m_type, m_values);
This call is problematic for 4K images processing. For LUTs with edge length of 65, the buffer is 65 * 65 * 65 * 3 * sizeof(float) in size which is 3.142834 MB in size. This memcpy command can take over 1ms, which adds up the more OCIO Ops we are layering onto our effects. When I added simple timing code to OCIO, the function call GetLut3DGPUShaderProgram can take 4ms to execute and our profiling in Studio shown that one OCIO::CLDTransform can take as much as 8ms during a frame rendering cycle on our test PC which meant that it would only take 2 of the effects to reduce the frame time below the threshold needed to maintain 60fps. When using OCIO v1.1.1 the OCIO::CLDTransform took around 0.01ms.
I’ve written a caching system to cache the GpuShaderDescRcPtr objects that were created using the GPUProcessor::extractGpuShaderInfo calls and it can maintain 60fps if the OCIO Op parameters aren’t being changed too often. However, any change will then caused a cache miss for the GpuShaderDescRcPtr and thus the frame rate will hiccup. It also meant that we can’t do animated effects during playback as that would caused cache misses. When I was looking into what v1.1.1 was doing differently, I've noticed that the function getGpuLut3DCacheID seems to be doing a simple shaderDesc cache too.
My question is how we can mitigate this? are we doing the right things? I’ve followed the guide on how to use the GPU path from here Building a GPU shader.