tesseract.js icon indicating copy to clipboard operation
tesseract.js copied to clipboard

Slow performance and memory errors caused by calls to fixOrientationFromUrlOrBlobOrFile

Open Balearica opened this issue 2 years ago • 0 comments

Overview

Note: If you are a user experiencing a memory error and just want a fix, skip to the bottom.

In version 2.1.2 a feature was added to auto-rotate images from files based on exif orientation data, and this was expanded to run on base64 encoded images in 2.1.3.

  • Original pull request: https://github.com/naptha/tesseract.js/pull/440
  • Current code: https://github.com/naptha/tesseract.js/blob/master/src/worker/browser/loadImage.js#L61

I have not tested the node implementation, but the browser implementation caused several issues. Notably, the change caused the function fixOrientationFromUrlOrBlobOrFile (which calls blueimpLoadImage) to be run on all images (regardless of whether exif data is even present). Unfortunately, this function is quite slow, so this change significantly increases runtime. Additionally, it causes a memory leak that several users have reported (#556,#476,#446 likely related).

Performance

The table below shows the performance impact of the addition of fixOrientationFromUrlOrBlobOrFile. I used 7 base64-encoded .png images (without any exif orientation data), and ran with a single worker on Chrome on Windows 10.

Engine Without fixOrientation With fixOrientation Change (%)
Legacy 6.1 15.5 154%
LSTM 30 40.2 34%

In both cases the fixOrientationFromUrlOrBlobOrFile calls added about 10 seconds to runtime. This is significant for both engines, but more than doubles runtime for the faster Legacy Tesseract engine.

Memory Leak

The fixOrientationFromUrlOrBlobOrFile function also causes a memory leak. In the version without fixOrientationFromUrlOrBlobOrFile the page's memory footprint does not rise significantly between pages when running a multi-page job. In the version with fixOrientationFromUrlOrBlobOrFile memory rises with every page recognized until either (1) the worker is killed or (2) the page crashes.

Conclusion

I have not investigated the specifics of why the fixOrientationFromUrlOrBlobOrFile causes this issue, and cannot speak to a better way to implement exif orientation detection and auto-rotation. Therefore, perhaps others know of a better way to implement this functionality. However, given the issues described above with this implementation, and the fact that this feature is not included in Tesseract (so is no means obligatory in a JS port of Tesseract), I believe simply reverting to before this was added would be preferable to leaving things as they are. I can create a merge request if others simply want to revert to the earlier implementation

Quick Fix

A branch that is identical to the current master except for reverting to the old version of loadImage.js can be found here. A replacement worker.min.js file is attached below. This should not experience the issues described above.

worker.min.zip

Balearica avatar Mar 27 '22 07:03 Balearica

Closing as I removed the offending code from the master branch. While I stated above that I believed simply reverting this update would be an improvement, I did add a line to recognize .jpeg exif orientation tags (see below), which keeps the exif-based rotation working (at least for the test images I used). Feel free to open an issue if you encounter a jpeg image where it doesn't work.

https://github.com/naptha/tesseract.js/blob/8b567609e38e728a225852632731dd19483342db/src/worker-script/utils/setImage.js#L20

Balearica avatar Aug 20 '22 03:08 Balearica