tesseract.js
tesseract.js copied to clipboard
Slow performance and memory errors caused by calls to fixOrientationFromUrlOrBlobOrFile
Overview
Note: If you are a user experiencing a memory error and just want a fix, skip to the bottom.
In version 2.1.2 a feature was added to auto-rotate images from files based on exif orientation data, and this was expanded to run on base64 encoded images in 2.1.3.
- Original pull request: https://github.com/naptha/tesseract.js/pull/440
- Current code: https://github.com/naptha/tesseract.js/blob/master/src/worker/browser/loadImage.js#L61
I have not tested the node implementation, but the browser implementation caused several issues. Notably, the change caused the function fixOrientationFromUrlOrBlobOrFile
(which calls blueimpLoadImage
) to be run on all images (regardless of whether exif data is even present). Unfortunately, this function is quite slow, so this change significantly increases runtime. Additionally, it causes a memory leak that several users have reported (#556,#476,#446 likely related).
Performance
The table below shows the performance impact of the addition of fixOrientationFromUrlOrBlobOrFile
. I used 7 base64-encoded .png images (without any exif orientation data), and ran with a single worker on Chrome on Windows 10.
Engine | Without fixOrientation | With fixOrientation | Change (%) |
---|---|---|---|
Legacy | 6.1 | 15.5 | 154% |
LSTM | 30 | 40.2 | 34% |
In both cases the fixOrientationFromUrlOrBlobOrFile
calls added about 10 seconds to runtime. This is significant for both engines, but more than doubles runtime for the faster Legacy Tesseract engine.
Memory Leak
The fixOrientationFromUrlOrBlobOrFile
function also causes a memory leak. In the version without fixOrientationFromUrlOrBlobOrFile
the page's memory footprint does not rise significantly between pages when running a multi-page job. In the version with fixOrientationFromUrlOrBlobOrFile
memory rises with every page recognized until either (1) the worker is killed or (2) the page crashes.
Conclusion
I have not investigated the specifics of why the fixOrientationFromUrlOrBlobOrFile
causes this issue, and cannot speak to a better way to implement exif orientation detection and auto-rotation. Therefore, perhaps others know of a better way to implement this functionality. However, given the issues described above with this implementation, and the fact that this feature is not included in Tesseract (so is no means obligatory in a JS port of Tesseract), I believe simply reverting to before this was added would be preferable to leaving things as they are. I can create a merge request if others simply want to revert to the earlier implementation
Quick Fix
A branch that is identical to the current master except for reverting to the old version of loadImage.js
can be found here. A replacement worker.min.js
file is attached below. This should not experience the issues described above.
Closing as I removed the offending code from the master branch. While I stated above that I believed simply reverting this update would be an improvement, I did add a line to recognize .jpeg
exif orientation tags (see below), which keeps the exif-based rotation working (at least for the test images I used). Feel free to open an issue if you encounter a jpeg image where it doesn't work.
https://github.com/naptha/tesseract.js/blob/8b567609e38e728a225852632731dd19483342db/src/worker-script/utils/setImage.js#L20