tesseract.js icon indicating copy to clipboard operation
tesseract.js copied to clipboard

Please call SetImage before attempting recognition.

Open konsumer opened this issue 4 years ago • 13 comments

Describe the bug

I am trying to use tesseract.js in nodejs, and can't seem to get it to work.

To Reproduce

I make code like this:

const tesseract = require('tesseract.js')

const extractImageText = async filename => {
  const worker = tesseract.createWorker()
  await worker.load()
  await worker.loadLanguage('eng')
  await worker.initialize('eng')
  const { data: { text } } = await worker.recognize(filename)
  await worker.terminate()
  return text
}

extractImageText('test.pdf').then(console.log)

I get this error:

Error in pixReadMemGif: function not present
Error in pixReadMem: gif: no pix returned
Error in pixGetSpp: pix not defined
Error in pixGetDimensions: pix not defined
Error in pixGetColormap: pix not defined
Error in pixCopy: pixs not defined
Error in pixGetDepth: pix not defined
Error in pixGetWpl: pix not defined
Error in pixGetYRes: pix not defined
Error in pixClone: pixs not defined
Please call SetImage before attempting recognition.
PROJECT/node_modules/tesseract.js/src/createWorker.js:173
        throw Error(data);
        ^

Error: RuntimeError: function signature mismatch
    at ChildProcess.<anonymous> (PROJECT/node_modules/tesseract.js/src/createWorker.js:173:15)
    at ChildProcess.emit (events.js:209:13)
    at emit (internal/child_process.js:876:12)
    at processTicksAndRejections (internal/process/task_queues.js:77:11)

Desktop (please complete the following information):

  • OS: OSX Catalina 10.15.5 (19F101)
  • Node: v12.9.1
  • Versio:n 2.1.1

konsumer avatar Jul 17 '20 11:07 konsumer

Per the project's documentation, I don't think they support pdf files. We converted the pdf file into an image (jpg or png) and drew it on a canvas, but we are getting the same error message.

KaileD avatar Aug 05 '20 13:08 KaileD

I get a similar error even with jpg files:

Error in pixReadMem: size < 12 Error in pixGetSpp: pix not defined Error in pixGetDimensions: pix not defined Error in pixGetColormap: pix not defined Error in pixCopy: pixs not defined Error in pixGetDepth: pix not defined Error in pixGetWpl: pix not defined Error in pixGetYRes: pix not defined Error in pixClone: pixs not defined Please call SetImage before attempting recognition. [2020-09-10 22:51:37]: ERROR Uncaught Exception: Error: RuntimeError: function signature mismatch at ChildProcess. (PROJECT/node_modules/tesseract.js/src/createWorker.js:173:15) at ChildProcess.emit (events.js:314:20) at emit (internal/child_process.js:906:12) at processTicksAndRejections (internal/process/task_queues.js:81:21) Error: RuntimeError: function signature mismatch at ChildProcess. (PROJECT/node_modules/tesseract.js/src/createWorker.js:173:15) at ChildProcess.emit (events.js:314:20) at emit (internal/child_process.js:906:12) at processTicksAndRejections (internal/process/task_queues.js:81:21)

budsta95 avatar Sep 11 '20 02:09 budsta95

This is not something to do with the file, can +1 this. Node support is broken.

Mooshua avatar Sep 14 '20 23:09 Mooshua

As the test case of node.js is still working, I think it is still the file itself cause the issue. Is it possible to provide the image file you use so I can check? And yes, right now tesseract.js doesn't suppport pdf files, you need to do the convertion first.

jeromewu avatar Sep 15 '20 01:09 jeromewu

@jeromewu you were right! Turns out the problem was that the code was trying to perform recognition before the file was ready (i download, resize, and crop it before calling recognition). I'm not too familiar with async js code (i'm a C dev) so it wasn't obvious to me how to code it correctly to ensure that each step is finished before executing the next.

budsta95 avatar Sep 15 '20 14:09 budsta95

Ok I also have the same problem, but the image I am using is in jpg format, can you help me fix this? Here is my code: `const express = require('express'); const app = express(); const fs = require('fs'); const multer = require('multer'); const { createWorker } = require('tesseract.js');

const worker = createWorker({ logger: m => console.log(m), // Add logger here });

const storage = multer.diskStorage({ destination: (req, file, cb) => { cb(null, "./uploads"); }, filename: (req, file, cb) => { cb(null, file.originalname); } });

const upload = multer({storage: storage}).single("avatar"); app.set("view engine", "ejs");

app.get('/',(req, res) =>{ res.render('index'); });

app.post('/upload', upload, (req, res, next) => { file = req.file if (!file) { const error = new Error('Please upload a file') error.httpStatusCode = 400 return next(error) } if (file) { (async () => { await worker.load(); await worker.loadLanguage('eng+fra'); await worker.initialize('fra'); const { data: { text } } = await worker.recognize(file); console.log(text); await worker.terminate(); })();
}

}) const PORT = 5000 || process.env.PORT; app.listen(5000, () => console.log("Server is running on port 5000"));` I would also like to know if with tesseract we can put the image in black white, define a white list of image to analyze (for example if the image contains "Hello" we do not perform the recognition?) Thank you for your answers. erreurTesseract erreurTesseract1

deiss98 avatar Sep 25 '20 17:09 deiss98

Have the same problem with .tiff files. I am loading the files locally so it shouldn't be an issue. It happens intermittently.

Is there something else I should be doing?

@jeromewu This is happening with many files I am attempting but here is one where it is happening commonly.

I can't attach .tiff files but this link will download it directly. https://recordsearch.kingcounty.gov/Landmarkweb/Document/GetTifDocumentByCFN/?cfn=3093907

aarmora avatar Jan 29 '21 15:01 aarmora

This just happened to me.

In my case, I did this:

  1. Had Tesseract start processing an image
  2. Made a change in my .js file that's calling Tesseract.
  3. Saved the file, so that Webpack applied my changes to the browser via HMR (hot module reloading)
  4. The error appeared

At least in my case, this seems like pretty reasonable behavior, since Webpack is changing the page's JavaScript while the Tesseract worker is running 🙂

neoncube2 avatar Apr 05 '21 07:04 neoncube2

can you share sample code or test case? @neoncube2

irpankusuma avatar Sep 22 '21 17:09 irpankusuma

Same issue here! I am using a blob of an image and sending it to my nodejs server. This is the code server side:

app.post('/imgocr',(req,res)=> {
    console.dir(req.body);
  
          Tesseract.recognize(req.body,'eng',{
            logger:m =>console.log("PROGRESS: "+m)
         }).then(({ data:{text}})=>{
             console.log("tESS RES: "+text);
         })
        
});

and this is the code in my chrome extension:

var saveData = (function () {
         
           return function (data, fileName) {
             console.log("fromsave "+ data);
               // var json = JSON.stringify(data),
               // blob = new Blob([json], { type: "octet/stream" });
               //console.log("fromsave "+ json);
               var xj = new XMLHttpRequest();
             xj.open("POST", "http://localhost:3000/imgocr", true);
             xj.setRequestHeader('Content-type','application/x-www-form-urlencoded');
             xj.send(data);
             xj.onreadystatechange = function () {  console.log(xj.responseText); } 
            
           };
       }());

const blob = await fetch(imgOcr).then(res => res.blob());
         console.log(typeof(blob));
         console.log(blob)
         let fileName = titleChannel + "download.json";
         saveData(blob, fileName);

The console.log(blob) give this :

Blob {size: 12687, type: 'image/jpeg'}
size: 12687
type: "image/jpeg"
[[Prototype]]: Blob

but the error is this :

[Object: null prototype] { "K�L��\x00$dsF�G�\bB\x1F�ɏ� j%�\x07�\x04n�A�\x0E��.x@J%�'S ñ<�!-��\x03�ėGФ*��~\x1B\x02\x0FhC\x1D\x07\x00�\x01� \x10�@-��%� àa���i�\x0F\nˢÔ\x04J *��\n\x06}�d\x06<����18�\x0F": '�(��\r��L\t\x06�I\x01\x03E��\x07\x19�Ή*M�?T\x01\x02M�����s)\x04�\x06ǝ|>3"���\x10C5�\n' + "@\t(\fQ\x05�KzÂ�ؚX��N9J\x03̢���\x00�!�'Tn\fd��L��,\n" + "��\x00\x02W{F��'���\x18�s\x1E)H]%\x1D�C�Ǧ�@{�i�\x00�\b��N?7f�_��\x14��\n" + '�\x11�D��2�Ӝ\x100�_ \x07��M"�S�P\x0E��P\x07G��r�Q\x00�LSB0y�6\b��d\x190��\x00\x00> ��ƅ�\x15\\B�\r\x02ݥڈ����(2�ǐ�� ���0���Y��S\x7F�\x16t\x10�z\x04�\r�AAf��,;(�\x00����' } PROGRESS: [object Object] PROGRESS: [object Object] PROGRESS: [object Object] PROGRESS: [object Object] PROGRESS: [object Object] PROGRESS: [object Object] PROGRESS: [object Object] PROGRESS: [object Object] PROGRESS: [object Object] Error in pixReadMem: size < 12 Error in pixGetSpp: pix not defined Error in pixGetDimensions: pix not defined Error in pixGetColormap: pix not defined Error in pixCopy: pixs not defined Error in pixGetDepth: pix not defined Error in pixGetWpl: pix not defined Error in pixGetYRes: pix not defined Error in pixClone: pixs not defined Please call SetImage before attempting recognition. /Users/tgfc-7/node_modules/tesseract.js/src/createWorker.js:173 throw Error(data); ^

Error: RuntimeError: null function or function signature mismatch at ChildProcess. (/Users/tgfc-7/node_modules/tesseract.js/src/createWorker.js:173:15) at ChildProcess.emit (node:events:520:28) at emit (node:internal/child_process:938:14) at processTicksAndRejections (node:internal/process/task_queues:84:21)

taouichaimaa avatar Feb 18 '22 16:02 taouichaimaa

Ok I also have the same problem, but the image I am using is in jpg format, can you help me fix this? Here is my code: `const express = require('express'); const app = express(); const fs = require('fs'); const multer = require('multer'); const { createWorker } = require('tesseract.js');

const worker = createWorker({ logger: m => console.log(m), // Add logger here });

const storage = multer.diskStorage({ destination: (req, file, cb) => { cb(null, "./uploads"); }, filename: (req, file, cb) => { cb(null, file.originalname); } });

const upload = multer({storage: storage}).single("avatar"); app.set("view engine", "ejs");

app.get('/',(req, res) =>{ res.render('index'); });

app.post('/upload', upload, (req, res, next) => { file = req.file if (!file) { const error = new Error('Please upload a file') error.httpStatusCode = 400 return next(error) } if (file) { (async () => { await worker.load(); await worker.loadLanguage('eng+fra'); await worker.initialize('fra'); const { data: { text } } = await worker.recognize(file); console.log(text); await worker.terminate(); })(); }

}) const PORT = 5000 || process.env.PORT; app.listen(5000, () => console.log("Server is running on port 5000"));` I would also like to know if with tesseract we can put the image in black white, define a white list of image to analyze (for example if the image contains "Hello" we do not perform the recognition?) Thank you for your answers. erreurTesseract erreurTesseract1

Any updates ?

taouichaimaa avatar Feb 18 '22 16:02 taouichaimaa

The tesseract.js works as child_process. If it can't recognize the given image, then throws SIGINT and falls the parent process too.

I found simple solution, that doesn't solve recognition problem, but prevent falling main application.

Just catch uncaughtException

process.on('uncaughtException', err => {
    logger.error(err.stack);
});

mrShturman avatar May 24 '22 07:05 mrShturman

Started new issue...

tgraupmann avatar Jun 29 '22 00:06 tgraupmann

I'm closing this issue as the OP's problem was caused by trying to recognize a .pdf (which we do not support). As the error message appears to be fairly generic, if you encounter an error with a similar message, please create a new issue including a reproducible example.

Balearica avatar Aug 20 '22 03:08 Balearica

Hello, I'd like to please know the link to issue created for this same problem for image files (JPG, tiff, etc as @budsta95 and @deiss98 reported). It is clear that think this is happening independently of the file format (for PDF, this is failing in packages such as pdf-extract that uses tesseract), but it is failing also for images that are supposedly covered by Tesseract.

Can you please provide us with the link to the issue for images to track the progress 🙏

DigitalLeaves avatar Dec 21 '22 16:12 DigitalLeaves

@DigitalLeaves There is no open issue because no user has provided a reproducible example of this happening with a jpeg or tiff. If you can provide a reproducible example using the latest version of Tesseract.js you should open a new issue.

Balearica avatar Dec 21 '22 19:12 Balearica

Good point @Balearica . This problem is affecting me from PDFs, but it is in the interaction with the tesseract library. I will try to put together an example of the failing PDF, but I'd like to ask @deiss98 and @budsta95 if they have an example with the image that they reported? That would be faster and easier for everybody I guess?

DigitalLeaves avatar Dec 22 '22 15:12 DigitalLeaves

@DigitalLeaves To confirm it is an issue with this library, we would need an image file that produces the error (as Tesseract.js only accepts image files). Projects that use Tesseract.js with PDFs work by adding additional libraries/steps to convert from PDF -> image and then run Tesseract.js on that image. Therefore, if there was a bug with the part of that pipeline that produces the images, that would produce the error message described in Tesseract.js, however that would not be a bug in Tesseract.js.

Balearica avatar Dec 22 '22 19:12 Balearica