OpenImageIO icon indicating copy to clipboard operation
OpenImageIO copied to clipboard

[BUG] Internal "contiguize" operation is incorrect

Open jessey-git opened this issue 10 months ago • 7 comments

Describe the bug The internal "contiguize" operation used when shifting between different TypeDescs is broken if the formats also differ in number of channels. E.g. Taking data from a TypeDesc::FLOAT,1-channel format to a TypeDesc::UINT16,3-channel format.

To Reproduce Steps to reproduce the behavior:

  /* A simple 8x4 FLOAT, 1-channel image */
  const int width = 8;
  const int height = 4;
  float data[] = {
    0.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.0,
    0.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.0,
    0.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.0,
    0.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.0,
  };

  /* In memory format */
  const int mem_channels = 1;
  stride_t mem_xstride = sizeof(float) * mem_channels;
  stride_t mem_ystride = width * mem_xstride;
  uchar * mem_start = reinterpret_cast<uchar*>(data);
  ImageSpec mem_spec(width, height, mem_channels, TypeDesc::FLOAT);

  /* We always write using a negative y-stride so ensure we start at the end of the data. */
  mem_start = mem_start + ((stride_t(height) - 1) * mem_ystride);

  ImageBuf orig_buf(mem_spec, mem_start, mem_xstride, -mem_ystride, AutoStride);

  /* Write out as a 8x4 UINT16, 3-channel image... */
  ImageSpec file_spec(width, height, 3, TypeDesc::UINT16);

  unique_ptr<ImageOutput> out = ImageOutput::create("tif");
  if (!out) {
    return false;
  }
  if (out->open("t:/oiio_16bit_debug.tiff", file_spec)) {
    orig_buf.write(out.get());
    out->close();
  }

Expected behavior Per the above, there's only 2 values in the initial data set: 0.0f and 0.5f. Saving the file out should still be 2 color tones.

Evidence Attached is the resulting .tiff but I'll embed an enlarged image here that shows the color-shift that results from incorrectly assigning the pixel data in the contiguize operation: image

oiio_16bit_debug.zip

jessey-git avatar Aug 24 '23 01:08 jessey-git

It's possible that I'm misunderstand what you're doing, but I'm not sure anything is broken except possibly your example. Let me tell you what I think is happening, and one of us (50/50 odds) will have an "aha" moment.

  1. You have a float[32] array containing { 0.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.0 } repeated 4 times.
  2. You declare an ImageBuf that "wraps" this buffer, describe it as 8 x 4 x 1 channel x float (makes sense) and give it strides and offset that basically should reverse it from top to bottom.
  3. You open an ImageOutput and tell it you want to write a 8 x 4 x 3 channel x uint16 image.
  4. You call ImageBuf::wrote(ImageOutput*) hoping to write your ImageBuf.

Now, as it turns out, underneath, this calls:

out->write_image(bufformat, m_impl->m_localpixels, pixel_stride(),
                 scanline_stride(), z_stride());

So here is the problem: write_image will expect to be pointed to enough data to write the image you've asked it to write (8 x 4 x 3chans). It will do data type conversion for you (it knows it wants to write uint16, and you can see that the bufformat parameter tells it that you've passed a buffer of floats, and it will handle that). But there is nothing passed here -- and no support implied by write_image() -- for somehow deducing that you've given it the wrong shaped buffer.

The write_image call gets an xstride (space in bytes from one pixel to the next), and ystride (space in bytes from one scanline to the next), but there is no parameter that controls the channel stride or number of channels. While it will step from pixel to pixel using the strides you supply, it assumes that each pixel contains data that looks like datatype[nchannels]. So ultimately, you're just seeing your 1-channel data array being interpreted 3-channel pixels. Given the strides, that means you see [0,.5,.5] (cyan), [.5,.5,.5] (grey) three times, then [.5, .5, 0] (yellow), then [.5, 0, 0] (red, because the second 0 is from the NEXT scanline), then [0, 0, .5] (the last 0 then the 0 and 0.5 from the NEXT scanline). The "upper right" pixel is the weird one, because it's grabbing values that are past the end of your original array (you see it at the top, not bottom, because of your y reversal).


TL;DR : ImageOutput::write_image can convert data types (float -> uint16) and can handle arbitrary pixel and scanline strides, but it cannot turn a 1-channel buffer into a 3-channel image.

For that matter, it has no way to tell from the strides how many channels you have in the buffer, it just assumes it's enough for the shape of the image you are writing. Which is just like how you didn't tell it the width or height of the buffer; it also assumes that you have enough pixels to match the size of the image you are writing.

lgritz avatar Aug 26 '23 23:08 lgritz

Ah, yes, I think you're right. I was blinded by the magic that OIIO typically provides and thought it would somehow know what to do here too... but it really cannot.

Would the following be the "most efficient" way to splat the 1-channel original into a 3 or 4 channel final buffer (setting alpha to 1 for the 4-channel)?

  orig_nchannels = 1;
  final_nchannels = ...; // Will be 3 or 4

  if (orig_nchannels == 1 && final_nchannels != 1) {
    ImageBufAlgo::channels(
        final_buf, orig_buf, final_nchannels, {0, 0, 0, -1}, {0, 0, 0, 1.0f});
  }

  final_buf.write(...);

jessey-git avatar Aug 27 '23 00:08 jessey-git

I suspect that the most efficient way would be to allocate a second buffer of the right H x W x 3chan dimensions, then call OIIO::convert_image() (which is just a general strided data shuffler) once for each channel, to copy the grey image into the right slot for each channel in the combined buffer, and also can handle data conversions for you. There's also a related OIIO::parallel_convert_image() that does the same thing multithreaded using the thread pool. Then call ImageOutput::write_image using the new buffer.

What you suggest, using ImageBufAlgo::channels(), would also work.

Also you could consider allocating an ImageBuf of the final data type and dimensions, and use 3 calls to ImageBufAlgo::paste() to copy the one channel of your original ImageBuf-wrapped buffer into the new one:

ImageBuf buf2 (ImageSpec(w, h, 3, TypeUint16));
for (int c = 0; c < 3; ++c)
    ImageBufAlgo::paste(buf2, 0, 0, 0, i, orig_buf);

I'm not sure what the relative efficiencies of these three methods (raw buffer work with parallel_convert image, IBA::chanels(), IBA::paste()) would be. I'd be curious to hear if anybody wants to run the experiment. (As well as compare the cost of doing any of these versus the cost of writing the result to disk.)

lgritz avatar Aug 27 '23 01:08 lgritz

Looks like paste wins (some numbers below). I'll be going with that I think. Much appreciated for bringing those 2 other solutions to my attention.

Using a 8192x8192, 1-channel, float image to start with

Only operation; no Image output
convert_image : 390.4 ms
paste         : 171.9 ms
channels      : 137.0 ms

Channels is competitive, but allocates 2x the memory and hasn't paid the price for datatype conversion yet...

Operation + Image output
convert_image :  869.0 ms
paste         :  643.8 ms
channels      : 1192.9 ms

jessey-git avatar Aug 27 '23 08:08 jessey-git

Did you try parallel_convert_image? It's possible that convert_image is the only one of the three that isn't automatically using multithreading.

If you are doing the data conversion with paste and convert_image, then channels without data conversion will certainly be a lot faster.

For the "only operation" timings, I would recommend timing each of these choices both with and without data conversion, and both with and without multithreading (just pass nthreads=1 to the functions that take it to force single thread).

The reason channels looks so fast to shuffle and so bad with the write is probably because if you didn't do data conversion in channels, that cost is shifted to the write itself. That's also why it's 2x the memory -- it has to allocate a float buffer for the result, instead of a uint16 buffer. You can fix this by preparing the buffer ahead of time:

// Easy but no data conversion:
ImageBuf newbuf = ImageBufAlgo::channels(oldbuf, ...);

// More steps but get control over the allocation and data conversion:
// Preallocate a buffer with the desired format and do the channel shuffling
// into that buffer.
ImageSpec newspec = oldbuf.spec();
newspec.nchannels = 3;
newspec.set_format(TypeUint16);
ImageBuf newbuf(newspec, InitializePixels::No);
ImageBufAlgo::channels(newbuf, oldbuf, ...);

But I think another high point of these timings is to notice that no matter which choice you make, the actual writing of the file is even more expensive. We try really hard not to be wasteful in OIIO, but a lot of the times that we decline the chance to implement more involved optimizations (such as adding GPU support to image operations) are because by the typical run is going to be dominated by the file write a the end.

lgritz avatar Aug 27 '23 16:08 lgritz

Those timings above were using parallel_convert_image actually already.

Using non-parallel convert_image is 2000ms with just the operation and 2500ms with the image write.

As for pre-allocating the buf for channels I thought that might help too. But the following inside channels wipes it right away: https://github.com/OpenImageIO/oiio/blob/master/src/libOpenImageIO/imagebufalgo_channels.cpp#L139 Confirmed with debug output too. Is that a bug?

// When letting channels just do the operations with the original float format
OIIO DEBUG: IB allocated 768 MB, global IB memory now 768 MB
perf_test_channels: 141.5842
OIIO DEBUG: IB freed 768 MB, global IB memory now 768 MB

// When pre-allocating the 3-chan UINT16 ImageBuf and giving channels that instead
OIIO DEBUG: IB allocated 384 MB, global IB memory now 384 MB
OIIO DEBUG: IB freed 384 MB, global IB memory now 384 MB
OIIO DEBUG: IB allocated 768 MB, global IB memory now 768 MB
perf_test_channels: 144.0540
OIIO DEBUG: IB freed 768 MB, global IB memory now 768 MB

jessey-git avatar Aug 27 '23 18:08 jessey-git

Yes, I think there are several important optimizations that we can make to channels, some of which have occurred to me before but I've never gotten around to looking at in detail. In order of less work / less gain to more work / potentially more gain:

  1. The reset() call that you point out should probably take a second parameter of InitializePixels::No to at least save a needless pass over the full buffer to zero it out (when you're just going to copy into it anyway, fully filling the data).
  2. We should only do the reset() if the size/shape as specified by the channels parameter is not the same as the old one. That is, there are cases where we only need to copy the new channel count, formats, and names, but not to reallocate/refill the memory.
  3. We should detect a specification of channels that does not change the order or names or types of channels at all (essentially a no-op) and just take an early out or make it a simple copy. This happens a lot, and oiiotool catches this case and takes a shortcut, but maybe it should also happen for the IBA calls as well.

lgritz avatar Aug 27 '23 22:08 lgritz