node-tesseract icon indicating copy to clipboard operation
node-tesseract copied to clipboard

Is it possible?

Open SPlatten opened this issue 7 years ago • 7 comments

For PDF's that contain text I am using pdf2json which gives me all the text nodes and PDF co-ordinates, for PDF's that do contain text I am using node-tesseract, however this extracts just the text, is it possible to get the co-ordinates of the text to go along with the output?

SPlatten avatar May 19 '17 08:05 SPlatten

I think what I am asking can be achieved by getting tesseract to use the "hocr" option which will cause it to output html which includes box coordinates for each text item. Now the question is, can the module pass this?

SPlatten avatar May 20 '17 18:05 SPlatten

Ok, I've modified tesseract.js inserting:

    command.push("hocr");

at line 70, this results in the output being HTML with box coordinates for every text item, is there another way of doing without modified tesseract.js ?

SPlatten avatar May 20 '17 19:05 SPlatten

After searching around, it seems the built in supported way to do this is to add a 'format' option to the options array specifying 'hocr' as the value.

[edit]...unfortunately it didn't help...back to using the solution in the previous post.

SPlatten avatar May 22 '17 13:05 SPlatten

Does anyone maintain this module anymore?

SPlatten avatar May 22 '17 18:05 SPlatten

You are honestly better off using a library that has native bindings to tesseract.

Or just replicate what this does, this library doesn't do anything special - in fact you could re-write it a lot cleaner with ES6 syntax

reecefenwick avatar May 22 '17 20:05 reecefenwick

@reecefenwick, thank you, I did a search around today and from what I was able to find node-tesseract seems to be the best module for node.js

I will modify the code tonight and implement "hocr" via the options. I've also ordered a book on ES6 as so far I haven't been familiar with it or what it can do.

SPlatten avatar May 23 '17 10:05 SPlatten

I think you can first modify the default var options at line22 of tesseract.js:

        options: {
               'l': 'eng',
               'psm': 3,
               'config': null,
               'binary': 'tesseract',
               'hocr':null
   },

then at line 70,add :

            if (options.hocr !== null) {
              command.push('hocr');
              }

in your code ,if you want to get hocr output ,do something like this:

       var options = {
            l: 'chi_sim+eng',
           psm: 4,
           hocr:'hocr'
         };

   tesseract.process( '/test.png', options, function(err, text) {
          if(err) {
                console.error(err);
           } else {
 	            console.log('----------------------------');
             console.log(text);
  }
});     

gforcelong avatar Nov 20 '17 01:11 gforcelong