tesseract.js
tesseract.js copied to clipboard
Top property in rectangles
Worker does not work properly when the top property is filled. I'm trying to extract the text vertically (top and bottom) instead of cutting it in the traditional way (left and right). With the exception of the last worker, everyone else cannot convert the image to text.
To Reproduce -RUN THE CODE BELOW
const { createWorker, createScheduler } = require('tesseract.js'); const path = require('path'); const scheduler = createScheduler(); const worker1 = createWorker(); const worker2 = createWorker(); const rectangles = [ { left: 0, top: 0, width: 1486, height: 334, }, { left: 0, top: 334, width: 1486, height: 334, }, ]; (async () => { await worker1.load(); await worker1.loadLanguage('eng'); await worker1.initialize('eng'); await worker2.load(); await worker2.loadLanguage('eng'); await worker2.initialize('eng'); scheduler.addWorker(worker1); scheduler.addWorker(worker2); const results = await Promise.all(rectangles.map((rectangle) => ( scheduler.addJob('recognize', 'https://tesseract.projectnaptha.com/img/eng_bw.png', { rectangle }) ))); console.log(results.map(r => r.data.text)); })();
Expected behavior An array containing the two halves of the extracted text was expected, but only the last half was extracted. And this is not just a problem with 2 workers I tested with 4 to try to speed up the process and 3 of the 4 workers did not work. Only the latter worked properly.
Screenshots In this image I removed the empty characters ("") and the line break commands (\ n), to improve visualization.
Desktop (please complete the following information):
- OS: Linux Ubuntu 18.04.5 LTS
Hi,
I'm having some issues with the position of the rectangles, so I don't know if it is related.
What I did with was: 1st - Loop through an array of rectangles that I want to OCR and passing each of them to the function bellow; 2nd - A function that creates an canvas element that receives the rectangle from the original canvas and calls the worker passing the new canvas element; 3rd - The worker function more or less like you have.
Cheers
I also had some weirdness with the rectangle
option and also went with just slicing the image myself with ctx.getImageData()
and passing that slice to Tesseract.
which version you are using ?
which version you are using ?
"tesseract.js": "^2.1.3"
Interesting. As this is merely an argument we pass to Tesseract (nothing in this codebase crops the image) it seems likely that this is an issue with Tesseract. Looking at the issues over there, there are indeed people who report this feature is broken.
https://github.com/tesseract-ocr/tesseract/issues/845