IPED Aborting OutOfMemoryError caused by TwelveMonkeys TIF reader plugin

Reported by an user, enableExternalParsing = true didn't work around the issue. After taking a look at the heap dump sent by her, 1 MemoryCache object used ~23GB of heap, and it was created by TwelveMonkeys TIFFImageReader:

Thread stack:

pool-3-thread-189
  at java.nio.channels.Channels$ReadableByteChannelImpl.read(Ljava/nio/ByteBuffer;)I (Unknown Source)
  at com.twelvemonkeys.imageio.stream.MemoryCache.readBlock([B)I (MemoryCache.java:77)
  at com.twelvemonkeys.imageio.stream.MemoryCache.fetchBlock()[B (MemoryCache.java:60)
  at com.twelvemonkeys.imageio.stream.MemoryCache.read(Ljava/nio/ByteBuffer;)I (MemoryCache.java:99)
  at com.twelvemonkeys.imageio.stream.BufferedChannelImageInputStream.fillBuffer()Z (BufferedChannelImageInputStream.java:156)
  at com.twelvemonkeys.imageio.stream.BufferedChannelImageInputStream.read()I (BufferedChannelImageInputStream.java:177)
  at com.twelvemonkeys.imageio.metadata.tiff.TIFFReader.isValidOffset(Ljavax/imageio/stream/ImageInputStream;J)Z (TIFFReader.java:368)
  at com.twelvemonkeys.imageio.metadata.tiff.TIFFReader.readLinkedIFDs(Ljavax/imageio/stream/ImageInputStream;)Lcom/twelvemonkeys/imageio/metadata/tiff/TIFFDirectory; (TIFFReader.java:143)
  at com.twelvemonkeys.imageio.metadata.tiff.TIFFReader.read(Ljavax/imageio/stream/ImageInputStream;)Lcom/twelvemonkeys/imageio/metadata/Directory; (TIFFReader.java:132)
  at com.twelvemonkeys.imageio.plugins.tiff.TIFFImageReader.readMetadata()V (TIFFImageReader.java:197)
  at com.twelvemonkeys.imageio.plugins.tiff.TIFFImageReader.readIFD(I)V (TIFFImageReader.java:390)
  at com.twelvemonkeys.imageio.plugins.tiff.TIFFImageReader.getWidth(I)I (TIFFImageReader.java:439)
  at iped.utils.ImageUtil.doGetSubSampledImage(Ljava/lang/Object;IILiped/utils/ImageUtil$BooleanWrapper;Ljava/lang/String;)Ljava/awt/image/BufferedImage; (ImageUtil.java:185)
  at iped.utils.ImageUtil.getSubSampledImage(Ljava/io/InputStream;IILiped/utils/ImageUtil$BooleanWrapper;Ljava/lang/String;)Ljava/awt/image/BufferedImage; (ImageUtil.java:148)
  at iped.engine.task.ImageThumbTask.createImageThumb(Liped/data/IItem;Ljava/io/File;)V (ImageThumbTask.java:295)
  at iped.engine.task.ImageThumbTask$ThumbCreator.run()V (ImageThumbTask.java:270)
  at java.util.concurrent.Executors$RunnableAdapter.call()Ljava/lang/Object; (Unknown Source)
  at java.util.concurrent.FutureTask.run()V (Unknown Source)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V (Unknown Source)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run()V (Unknown Source)
  at java.lang.Thread.run()V (Unknown Source)

Dec 28 '23 04:12 lfcnassif

The user was using IPED 4.1.x, right? If that was the case, processing with the master branch would be a valid attempt, as the library was upgraded in #1966. Not sure if it is possible, but it would be great to have the triggering TIF.

Dec 28 '23 04:12 wladimirleite

Yes, she used 4.1.5. I'll try to get the triggering TIF, reproduce and test with master. Thanks @wladimirleite!

PS: another user got OOME with a 80GB heap, I suspect the issue may be the same, still investigating with him...

Dec 28 '23 11:12 lfcnassif

Yes, she used 4.1.5. I'll try to get the triggering TIF, reproduce and test with master. Thanks @wladimirleite!

By the log, a 37GB!!! TIF file was being processed when the OOME was thrown. Just asked the file to her.

PS: another user got OOME with a 80GB heap, I suspect the issue may be the same, still investigating with him...

No reference to TIF files found in the log sent by the second user that reported OOME.

Dec 28 '23 14:12 lfcnassif

By the log, a 37GB!!! TIF file was being processed when the OOME was thrown. Just asked the file to her.

I got the file, reproduced with both 4.1.4 and master using 10GB heap. Excluding 12 monkeys TIF imageio plugin fixed the OOME.

Dec 31 '23 03:12 lfcnassif

I got the file, reproduced with both 4.1.4 and master using 10GB heap. Excluding 12 monkeys TIF imageio pluging fixed the OOME.

@lfcnassif, you can send me the TIF if you like. I have the TwelveMonkeys code/environment set up for testing, so I can try to narrow down where exactly the problem is, and maybe find a workaround or report them an issue.

Dec 31 '23 13:12 wladimirleite

@lfcnassif, you can send me the TIF if you like. I have the TwelveMonkeys code/environment set up for testing, so I can try to narrow down where exactly the problem is, and maybe find a workaround or report them an issue.

Thanks @wladimirleite! I can send you the file when you return back to work after new year. Obviously it's a corrupted TIF. Just one question: we added 12 monkeys TIF plug-in because it gives better results on well formed TIFs than Java 11 default decoder, right?

Dec 31 '23 14:12 lfcnassif

Thanks @wladimirleite! I can send you the file when you return back to work after new year.

Sure! I will be back to work on Wednesday.

Obviously it's a corrupted TIF. Just one question: we added 12 monkeys TIF plug-in because it gives better results on well formed TIFs than Java 11 default decoder, right?

In #1530 there is a long discussion and test results about the usage of this plug-in.

Dec 31 '23 14:12 wladimirleite

In #1530 there is a long discussion and test results about the usage of this plug-in.

Thanks! My memory always tricks me :)

Dec 31 '23 15:12 lfcnassif

Thanks! My memory always tricks me :)

Mine too! I had to look up :-)

Dec 31 '23 19:12 wladimirleite

I think I found a workaround (in IPED). I will finish the investigation and update here in the next days.

Jan 06 '24 00:01 wladimirleite

I remember we enabled ImageIO cache on memory static setting by default, maybe it's related to this issue.

Jan 06 '24 12:01 lfcnassif

I remember we enabled ImageIO cache on memory static setting by default, maybe it's related to this issue.

It is related! I will post an overview of what is causing the issue, once I fully understand. I think am almost there :-)

Jan 06 '24 13:01 wladimirleite

Well, let me try to explain what I found out after a lot of tests and code analysis...

I believe the issue is not directly related to TwelveMonkeys TIFFImageReader. The default plugin (com.sun.imageio.plugins.tiff.TIFFImageReader) "rejects" (does not support) this large file. So it will use TwelveMonkeys plugin even in TiffPageParser (where the default plugin is preferred). When we use ImageIO.setUseCache(false), no disk-based cache is used (only memory). That provided a major speed up on image related tasks. On the other hand, when reading from an InputStream, the TIFF image reader needs to keep the already ready content somewhere (in this case memory, as disk cache is disabled), because it may need to read bytes already consumed (like going backwards in the stream). In the case of this large file, the offset of required metadata to decode it is close to its end, so it will try to read it all and store it in the memory. When it reached ~24GB, an OOME was thrown.

ImageInputStream iis = ImageIO.createImageInputStream(source);

In the line above, if source is a File, it will go directly (seek) to the metadata offset, read it, without having to cache anything in memory. If source is a FileInputStream, it will have the same behavior, as the ImageInputStream created is seekable. However, if source is a TikaInputStream, the ImageIO classes do not know how to deal with it, and it treats it as a generic InputStream. In such a case, a cache will be used. With ImageIO.setUseCache(false) it would use memory, while with ImageIO.setUseCache(true) a temporary file would be created.

I think we can improve the current behavior (which always passes a generic InputStream as source). So whenever we already have a file, use the file. If the Item data is loaded in a byte[] (which I guess is the case for smaller files), we can use a ByteArrayImageInputStream. If we don't have a file, for image files with more than N bytes (e.g. 64M), it is better to create a temporary file, to avoid excessive memory consumption.

My idea is to implement a method like item.createImageInputStream(), so this logic does not need to be replicated in different places that may need to create an ImageInputStream of an item.

Feb 08 '24 21:02 wladimirleite

Great explanation @wladimirleite, thank you!

I think we can improve the current behavior (which always passes a generic InputStream as source). So whenever we already have a file, use the file. If the Item data is loaded in a byte[] (which I guess is the case for smaller files), we can use a ByteArrayImageInputStream. If we don't have a file, for image files with more than N bytes (e.g. 64M), it is better to create a temporary file, to avoid excessive memory consumption.

I agree with this general approach! But I'm not sure about the best API change to put it.

Feb 08 '24 22:02 lfcnassif

I agree with this general approach! But I'm not sure about the best API change to put it.

Yes, this is very tricky because there are different usages across the application. I am also unsure about that, but I already made some code changes. I still need to test it.... Let me finish the tests, then I will create a PR and you can decide where is the best way to make the changes. Moving the code shouldn't be too hard.

Feb 08 '24 22:02 wladimirleite

@lfcnassif, after starting and stopping to work on this issue several times, I finally finished the implementation I had in mind. I am running a set of tests and will analyze the results carefully to avoid any regression before submitting a PR, but so far tests seem fine and the code changes were limited to specific points.

The implemented idea was more or less described before, but the basic idea is to get a ImageInputStream from a IItemReader with following priorities:

ByteArrayImageInputStream, if the item is cached (data is available);
FileImageInputStream, if item has a temporary file or has a file;
MemoryCacheImageInputStream, if item's length is less than 128 MB;
Otherwise (very large files) a temporary file is created to avoid memory issues, and then FileImageInputStream is used.

Currently, in most situations we use ImageIO.createImageInputStream(InputStream), which will create a memory cache for the InputStream even if it is already in memory. The new approach should save some memory during evidence processing, and avoid OOME for very large files (like the TIF that motivated this issue). It is also more deterministic as ImageIO.createImageInputStream() sometimes creates TwelveMonkeys classes, while other times uses Java classes, depending on the input.

Mar 12 '24 22:03 wladimirleite

Great, thank you very much @wladimirleite!

Mar 12 '24 22:03 lfcnassif

IPED IPED copied to clipboard

Aborting OutOfMemoryError caused by TwelveMonkeys TIF reader plugin

IPED
IPED copied to clipboard