tesseract.js
tesseract.js copied to clipboard
Errors when using Buffer from Jimp
When I try to recognize buffer object that is created with Jimp I get errors.
Error in pixReadMem: Unknown format: no pix returned
Error in pixGetSpp: pix not defined
Error in pixGetDimensions: pix not defined
Error in pixGetColormap: pix not defined
Error in pixCopy: pixs not defined
Error in pixGetDepth: pix not defined
Error in pixGetWpl: pix not defined
Error in pixGetYRes: pix not defined
Error in pixClone: pixs not defined
Warning: Invalid resolution 0 dpi. Using 70 instead.
Error in pixClone: pixs not defined
Error in pixCopy: pixs not defined
tess->pix_binary() != nullptr:Error:Assert failed:in file /src/src/ccmain/osdetect.cpp, line 201
trap!
I guess this is because tesseract expects the Buffer to contain the full image file with metadata, while Buffer from Jimp is just a pure bitmap. In that case, it would be cool if Tesseract could operate on just bitmaps. If not, add warning for this in the docs ;)
I should also note that this worked fine in previous version of tesseract.js (1.0)
To Reproduce
- Create project with tesseract 2 and Jimp and an image file
image.png. - Create file
const { createWorker } = require('tesseract.js')
const Jimp = require('jimp')
const filename = 'image.png'
;(async () => {
const image = await Jimp.read(filename)
const worker = createWorker()
await worker.load()
await worker.loadLanguage('eng')
await worker.initialize('eng')
const result = await worker.recognize(image.bitmap.data)
console.log(JSON.stringify(result.data))
return worker.terminate()
})()
3. Run the file
- OS: [Windows 10]
- Env: [Node.js 12]
- Version [2.0.0-beta.2]
Workaround for now is to let jimp create buffer by mime type with getBuffer:
const buffer = await image.getBufferAsync('image/png')
const result = await worker.recognize(buffer)
This is not ideal though, because it requires creating another buffer needlessly.
Now that I think about it, it can't work that way, because bitmap Buffer doesn't contain information about dimensions. Now I don't understand how I managed to make it work before in 1.0.
Anzway, If I could pass a buffer with width/height info to Tesseract, that would be awesome ;)
One quick fix is to put your buffer into a canvas and pass the canvas to recognize() function, we will add this feature in the next release.
is it aready fixed?
@jeromewu Maybe it's pretty easy, would you please give the example code to convert the buffer to canvas? I did not found any proper solution. thanks in advance
can you provide us with how to convert buffer data into a canvas
Please disregard the comment above about converting to canvas. Canvas is an API native to browsers (not Node.js), and only the browser version of Tesseract.js seamlessly supports canvas inputs. It looks like jimp is a Node.js library.
Regarding what images are supported: Tesseract.js does not support raw pixel data (which is returned by image.bitmap.data) as an input type, for either browser or Node. After reviewing the documentation, I agree that it is unclear on this point and should be clarified.
Regarding using Tesseract.js with Jimp: it looks like Jimp has multiple methods that transform the data into formats that Tesseract.js does accept. In fact, despite the follow-up comment, @panstromek's original suggested fix (using getBufferAsync) works perfectly well.
const result = await worker.recognize(await image.getBufferAsync(Jimp.MIME_PNG));
I think the cause of the confusion above is conflation of image formats with data types. Buffers that contain supported image formats (e.g. png) can be used, while buffers that do not contain supported image formats (e.g. the raw data in image.bitmap.data) are not supported.
I updated the documentation to clarify what image formats/data types are supported. This includes adding following note to prevent this misunderstanding from occurring again.
Note: images must be a supported image format and a supported data type. For example, a buffer containing a png image is supported. A buffer containing raw pixel data is not supported.