tesseract.js
tesseract.js copied to clipboard
Fix asynchronous caching bug
There are currently many issues that appear to stem from 2 problems in how caching works at present.
- We assume that workers are created synchronously, and violating this assumption creates invalid cache files
- We assume that all cache files are valid
The former appears to be the most common cause of invalid caching data (as this is non-obvious to users). However, cache may be invalid for other reasons. For example, until the last version cache was often invalid because langData responses were cached (see #585). Therefore, it is possible that not all bugs listed below were directly caused by creating workers asynchronously, but hopefully solving the async issue will solve most of it.
Related issues:
- #414
- #439
- #462
- #536
- #576
- #579
- #602
Upon further investigation, this appears to already be fixed (at least for Node.js). The following code snippet throws an error consistently in Version 2 however does not throw an error in Version 3.
const { createWorker, createScheduler } = require('../../');
const scheduler = createScheduler();
// Creates worker and adds to scheduler
const workerGen = async () => {
const worker = createWorker({cachePath: "."});
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
scheduler.addWorker(worker);
}
const workerN = 10;
(async () => {
const resArr = Array(workerN);
for (let i=0; i<workerN; i++) {
resArr[i] = workerGen();
}
await Promise.all(resArr);
/** Add 4 recognition jobs */
const results = await Promise.all(Array(10).fill(0).map(() => (
scheduler.addJob('recognize', 'https://tesseract.projectnaptha.com/img/eng_bw.png').then((x) => console.log(x.data.text))
)))
await scheduler.terminate(); // It also terminates all workers.
})();
While this issue seems to be largely resolved in version 3 (as stated above), one contributing factor appears to be that when cacheMethod=='write'
(the default option) the cache file is overwritten on every call to loadLanguage
even if the data was sourced from the cache file. In other words, the cache file is frequently overwritten with identical contents.
https://github.com/naptha/tesseract.js/blob/dd6c40b6818468f45cf006e844501e6afdb377b1/src/worker-script/index.js#L134-L136
I implemented an edit in the dev/v4 branch to no longer do this, which should reduce the number of times the cache is overwritten, and therefore the potential for the file being corrupted.