[BUG] Slow read performance for multichannel EXRs in oiiotool

Open BigRoy opened this issue 7 months ago • 1 comments

Describe the bug

I'm getting very slow read performance on multi-channel EXRs where I just need to parse the RGBA channel and write out a .png file from it.

From 19s on my own machine (which is crazy for a 55 MB exr?) to 5s on the fastest machine I have around. On my own slow machine, taking the same EXR (on no cold/hot cache) through Fusion to a PNG takes 1.3s

The 19 seconds (but 1.3s through fusion) is on:

Windows 10
2x Intel(R) Xeon(R) CPU E5-2680 0 @2.70Ghz
- Dual Xeon in HP workstation
128 GB RAM
4 GB GPU

The 5 seconds is on:

Windows 11
AMD Ryzen 9 9900X 12-Core Processor (4.40 Ghz)
128 GB RAM
12 GB GPU

OpenImageIO version and dependencies

See test runs here.

OIIO 3.1.1.0dev | unknown arch?
    Build compiler: MSVS 1943 | C++17/199711
    HW features enabled at build: sse2
    No CUDA support (disabled / unavailable at build time)
Dependencies: DCMTK NONE, expat 2.6.3, FFmpeg NONE, fmt 10.2.1, Freetype 2.13.2, GIF NONE, Imath 3.2.0, JPEG 80,
    JXL NONE, libdeflate 1.23, Libheif NONE, libjpeg-turbo 3.0.4, LibRaw NONE, libuhdr 1.2.0, minizip-ng 4.0.7,
    OpenColorIO 2.4.1, OpenCV NONE, OpenEXR 3.4.0, OpenJPEG NONE, OpenVDB NONE, PNG 1.6.47, Ptex NONE, Ptex NONE,
    pystring 1.1.4, Robinmap 1.4.0, TBB 2022.2.0, TIFF 4.6.0, WebP 1.4.0, yaml-cpp 0.8.0, ZLIB 1.3.1, ZLIB 1.3.1

To Reproduce

Steps to reproduce the behavior:

oiiotool -v -i:ch=R,G,B,A input.exr -o output.png

I'm explicitly filtering -i:ch=R,G,B,A to avoid reading the rest of the file. Also adding :now=1 doesn't make much of a difference. Other options, like also passing -ch flag are included in the linked test runs with their timings. Using i:ch=... gave the fastest results, but still at 19s seems awful.

Here's the file I'm testing with: https://we.tl/t-Fjm4XVFw8z It'll only remain available for about four weeks.

Evidence

See test runs here.

Additional Context

This also came up on ASWF slack with a lengthy thread here

Some notable comments from there:

Build options didn't make much of a difference?

So, I've been testing some build variants. When enabling more hardware features like SSSE3, SSE4.1 SSE4.2, AVX etc. makes it so that the executable doesn't work on my machine - likely due to some AVX (it does work on newer machines but there was no real measurable performance difference on this simple read/write test cases) so I opted going back to just the sse2 default but adding OpenTBB which does run. I've added the performance on my slow machine to the gist (see link below) There is no noticable speed difference. It remains around 18/19 seconds on my machine

For what it's worth, using --frames 1001-1006 and --parallel-frames also takes about 20 seconds on my machine (so almost no time increase). Using --frames 1001-1030 with --parallel-frames takes about 53 seconds on my machine.

So I can get good gains with allowing it to read many files. However, it's still the majority the read performance taking a long time it seems.

Apr 23 '25 11:04 BigRoy

I can confirm that things are probably too slow here as well -- Intel i7-8750H laptop 2018, ~8.2 seconds for the above command.

A profile shows that the vast majority of time is for conversion and memcpy'ing things around. In particular the ImageInput::read_tile -> convert_image sequence. Debugging shows that read_tile is looping through all 78 channels of this image even though the command line wants to just touch the first 4[1]. Beyond that, yes, half<->float conversion is pretty bad.

Additionally, the copy_image is pretty slow too because it's issuing hundreds of 4-byte calls to memcpy[2]. No amount of SSE will make that particular length better really.

[1] https://github.com/AcademySoftwareFoundation/OpenImageIO/blob/main/src/libOpenImageIO/imageinput.cpp#L593 [2] https://github.com/AcademySoftwareFoundation/OpenImageIO/blob/main/src/libOpenImageIO/imageio.cpp#L1018

Apr 24 '25 01:04 jessey-git