tesseract.js icon indicating copy to clipboard operation
tesseract.js copied to clipboard

OCR number in Barcode not work

Open allancmello opened this issue 3 years ago • 4 comments

Describe the bug Tesseract does not read the numbers below the barcode and only shows the letter T that is missing from the image. Is it possible to read the barcode numbers?

To Reproduce import { Component } from "@angular/core"; import { createWorker } from "tesseract.js";

@Component({ selector: "app-root", templateUrl: "./app.component.html", styleUrls: ["./app.component.css"] }) export class AppComponent { title = "tesseract.js-angular-app"; ocrResult = "Recognizing..."; constructor() { this.doOCR(); } async doOCR() { const worker = createWorker({ logger: m => console.log(m) }); await worker.load(); await worker.loadLanguage("por"); await worker.initialize("por"); const { data: { text } } = await worker.recognize( "https://clubb2b.com.br/images/med_barcode_1.png" ); this.ocrResult = text; console.log(text); await worker.terminate(); } }

Expected behavior Read number barcode: 7898096577840

Screenshots image

Desktop:

  • OS: Windows 64bits
  • Browser chrome
  • Version 89.0.4389.114

Additional context Sample in https://stackblitz.com/edit/github-3vemxs?file=src%2Fapp%2Fapp.component.ts

allancmello avatar Apr 07 '21 16:04 allancmello

Hi. Thank you for the report. I am guessing that the barcode is so close that it is difficult for Tesseract to detect where the actual text is. The "T" might be a missread "7" from the "7" in the bottom left. I do not see any easy solution to make this easier for Tesseract. It might help to cut away the barcode, leaving only the number. But this is in itself about as complicated as reading the barcode in the first place.

Not sure what your use-case is, but the easiest solution might be to use the barcode itself. It is designed to be easily machine readable and it should encode the same number as the one written below. I am sure there are some barcode readers for Javascript out there that can be used instead.

falktan avatar Apr 16 '21 15:04 falktan

One additional thing that might help, is to limit the allowed characters to digits. See the example here: https://github.com/naptha/tesseract.js/blob/master/docs/examples.md#with-whitelist-char-200-beta1

falktan avatar Apr 16 '21 15:04 falktan

Hi falktan,

Thanks for answering. I'm setting up a node server to integrate with whatsapp, so the user will take the photo of the barcode and send it to the server that will read the image. WhatsApp does not have a barcode reader. Then, the sending of the barcode image will be read by the tesseract by extracting the numbers and not reading the bars. If you know of another way to do it and can inform me, thank you.

allancmello avatar Apr 16 '21 21:04 allancmello

Then, the sending of the barcode image will be read by the tesseract by extracting the numbers and not reading the bars.

Not sure why you prefer to read the numbers rather than the bars. Using the bars should make it much easier.

I just googled "Barcode Scanner open source". Maybe you would like to have a look at this: https://serratus.github.io/quaggaJS/ (I did not test this, but it looks like what you need). This should allow you to get the number from scanning the bars.

falktan avatar Apr 18 '21 10:04 falktan

Closing as answered. falktan correctly notes that setting a whitelist to only include digits will help with recognizing digits, and that scanning bar codes directly is more reliable than recognizing text. Past that, generic issues with Tesseract recognition quality are outside the scope of this repo (as this project is a wasm port of the Tesseract recognition engine, which we do not edit).

Balearica avatar Aug 28 '22 01:08 Balearica