tesseract.js icon indicating copy to clipboard operation
tesseract.js copied to clipboard

Fix asynchronous caching bug

Open Balearica opened this issue 1 year ago • 2 comments

There are currently many issues that appear to stem from 2 problems in how caching works at present.

  1. We assume that workers are created synchronously, and violating this assumption creates invalid cache files
  2. We assume that all cache files are valid

The former appears to be the most common cause of invalid caching data (as this is non-obvious to users). However, cache may be invalid for other reasons. For example, until the last version cache was often invalid because langData responses were cached (see #585). Therefore, it is possible that not all bugs listed below were directly caused by creating workers asynchronously, but hopefully solving the async issue will solve most of it.

Related issues:

  1. #414
  2. #439
  3. #462
  4. #536
  5. #576
  6. #579
  7. #602

Balearica avatar Sep 18 '22 05:09 Balearica

Upon further investigation, this appears to already be fixed (at least for Node.js). The following code snippet throws an error consistently in Version 2 however does not throw an error in Version 3.

const { createWorker, createScheduler } = require('../../');

const scheduler = createScheduler();

// Creates worker and adds to scheduler
const workerGen = async () => {
  const worker = createWorker({cachePath: "."});
  await worker.load();
  await worker.loadLanguage('eng');
  await worker.initialize('eng');
  scheduler.addWorker(worker);
}

const workerN = 10;
(async () => {
  const resArr = Array(workerN);
  for (let i=0; i<workerN; i++) {
    resArr[i] = workerGen();
  }
  await Promise.all(resArr);
  /** Add 4 recognition jobs */
  const results = await Promise.all(Array(10).fill(0).map(() => (
    scheduler.addJob('recognize', 'https://tesseract.projectnaptha.com/img/eng_bw.png').then((x) => console.log(x.data.text))
  )))
  await scheduler.terminate(); // It also terminates all workers.
})();

Balearica avatar Sep 18 '22 06:09 Balearica

While this issue seems to be largely resolved in version 3 (as stated above), one contributing factor appears to be that when cacheMethod=='write' (the default option) the cache file is overwritten on every call to loadLanguage even if the data was sourced from the cache file. In other words, the cache file is frequently overwritten with identical contents.

https://github.com/naptha/tesseract.js/blob/dd6c40b6818468f45cf006e844501e6afdb377b1/src/worker-script/index.js#L134-L136

I implemented an edit in the dev/v4 branch to no longer do this, which should reduce the number of times the cache is overwritten, and therefore the potential for the file being corrupted.

Balearica avatar Sep 20 '22 02:09 Balearica