tesseract.js icon indicating copy to clipboard operation
tesseract.js copied to clipboard

Top property in rectangles

Open MiguelBragaGarcia opened this issue 3 years ago • 5 comments

Worker does not work properly when the top property is filled. I'm trying to extract the text vertically (top and bottom) instead of cutting it in the traditional way (left and right). With the exception of the last worker, everyone else cannot convert the image to text.

To Reproduce -RUN THE CODE BELOW

const { createWorker, createScheduler } = require('tesseract.js');
const path = require('path');


const scheduler = createScheduler();
const worker1 = createWorker();
const worker2 = createWorker();

const rectangles = [
  {
    left: 0,
    top: 0,
    width: 1486,
    height: 334,
  },
  {
    left: 0,
    top: 334,
    width: 1486,
    height: 334,
  },
];

(async () => {

  await worker1.load();
  await worker1.loadLanguage('eng');
  await worker1.initialize('eng');

  await worker2.load();
  await worker2.loadLanguage('eng');
  await worker2.initialize('eng');

  scheduler.addWorker(worker1);
  scheduler.addWorker(worker2);


  const results = await Promise.all(rectangles.map((rectangle) => (
    scheduler.addJob('recognize', 'https://tesseract.projectnaptha.com/img/eng_bw.png', { rectangle })
  )));
  console.log(results.map(r => r.data.text));
})();

Expected behavior An array containing the two halves of the extracted text was expected, but only the last half was extracted. And this is not just a problem with 2 workers I tested with 4 to try to speed up the process and 3 of the 4 workers did not work. Only the latter worked properly.

Screenshots In this image I removed the empty characters ("") and the line break commands (\ n), to improve visualization. bug

Desktop (please complete the following information):

  • OS: Linux Ubuntu 18.04.5 LTS

MiguelBragaGarcia avatar Sep 24 '20 18:09 MiguelBragaGarcia

Hi,

I'm having some issues with the position of the rectangles, so I don't know if it is related.

What I did with was: 1st - Loop through an array of rectangles that I want to OCR and passing each of them to the function bellow; 2nd - A function that creates an canvas element that receives the rectangle from the original canvas and calls the worker passing the new canvas element; 3rd - The worker function more or less like you have.

Cheers

profabioalvespinto avatar Oct 06 '20 17:10 profabioalvespinto

I also had some weirdness with the rectangle option and also went with just slicing the image myself with ctx.getImageData() and passing that slice to Tesseract.

cxcorp avatar Dec 30 '20 10:12 cxcorp

which version you are using ?

squalvj avatar Feb 04 '21 10:02 squalvj

which version you are using ?

"tesseract.js": "^2.1.3"

MiguelBragaGarcia avatar Feb 04 '21 11:02 MiguelBragaGarcia

Interesting. As this is merely an argument we pass to Tesseract (nothing in this codebase crops the image) it seems likely that this is an issue with Tesseract. Looking at the issues over there, there are indeed people who report this feature is broken.

https://github.com/tesseract-ocr/tesseract/issues/845

Balearica avatar Sep 04 '22 03:09 Balearica