[BUG] Slow read performance for multichannel EXRs in oiiotool
Describe the bug
I'm getting very slow read performance on multi-channel EXRs where I just need to parse the RGBA channel and write out a .png file from it.
From 19s on my own machine (which is crazy for a 55 MB exr?) to 5s on the fastest machine I have around. On my own slow machine, taking the same EXR (on no cold/hot cache) through Fusion to a PNG takes 1.3s
The 19 seconds (but 1.3s through fusion) is on:
- Windows 10
- 2x Intel(R) Xeon(R) CPU E5-2680 0 @2.70Ghz
- Dual Xeon in HP workstation
- 128 GB RAM
- 4 GB GPU
The 5 seconds is on:
- Windows 11
- AMD Ryzen 9 9900X 12-Core Processor (4.40 Ghz)
- 128 GB RAM
- 12 GB GPU
OpenImageIO version and dependencies
See test runs here.
OIIO 3.1.1.0dev | unknown arch?
Build compiler: MSVS 1943 | C++17/199711
HW features enabled at build: sse2
No CUDA support (disabled / unavailable at build time)
Dependencies: DCMTK NONE, expat 2.6.3, FFmpeg NONE, fmt 10.2.1, Freetype 2.13.2, GIF NONE, Imath 3.2.0, JPEG 80,
JXL NONE, libdeflate 1.23, Libheif NONE, libjpeg-turbo 3.0.4, LibRaw NONE, libuhdr 1.2.0, minizip-ng 4.0.7,
OpenColorIO 2.4.1, OpenCV NONE, OpenEXR 3.4.0, OpenJPEG NONE, OpenVDB NONE, PNG 1.6.47, Ptex NONE, Ptex NONE,
pystring 1.1.4, Robinmap 1.4.0, TBB 2022.2.0, TIFF 4.6.0, WebP 1.4.0, yaml-cpp 0.8.0, ZLIB 1.3.1, ZLIB 1.3.1
To Reproduce
Steps to reproduce the behavior:
oiiotool -v -i:ch=R,G,B,A input.exr -o output.png
I'm explicitly filtering -i:ch=R,G,B,A to avoid reading the rest of the file. Also adding :now=1 doesn't make much of a difference.
Other options, like also passing -ch flag are included in the linked test runs with their timings. Using i:ch=... gave the fastest results, but still at 19s seems awful.
Here's the file I'm testing with: https://we.tl/t-Fjm4XVFw8z It'll only remain available for about four weeks.
Evidence
See test runs here.
Additional Context
This also came up on ASWF slack with a lengthy thread here
Some notable comments from there:
Build options didn't make much of a difference?
So, I've been testing some build variants. When enabling more hardware features like SSSE3, SSE4.1 SSE4.2, AVX etc. makes it so that the executable doesn't work on my machine - likely due to some AVX (it does work on newer machines but there was no real measurable performance difference on this simple read/write test cases) so I opted going back to just the sse2 default but adding OpenTBB which does run. I've added the performance on my slow machine to the gist (see link below) There is no noticable speed difference. It remains around 18/19 seconds on my machine
For what it's worth, using
--frames 1001-1006and--parallel-framesalso takes about 20 seconds on my machine (so almost no time increase). Using--frames 1001-1030with--parallel-framestakes about 53 seconds on my machine.So I can get good gains with allowing it to read many files. However, it's still the majority the read performance taking a long time it seems.
I can confirm that things are probably too slow here as well -- Intel i7-8750H laptop 2018, ~8.2 seconds for the above command.
A profile shows that the vast majority of time is for conversion and memcpy'ing things around. In particular the ImageInput::read_tile -> convert_image sequence. Debugging shows that read_tile is looping through all 78 channels of this image even though the command line wants to just touch the first 4[1]. Beyond that, yes, half<->float conversion is pretty bad.
Additionally, the copy_image is pretty slow too because it's issuing hundreds of 4-byte calls to memcpy[2]. No amount of SSE will make that particular length better really.
[1] https://github.com/AcademySoftwareFoundation/OpenImageIO/blob/main/src/libOpenImageIO/imageinput.cpp#L593 [2] https://github.com/AcademySoftwareFoundation/OpenImageIO/blob/main/src/libOpenImageIO/imageio.cpp#L1018