pdfdataextract icon indicating copy to clipboard operation
pdfdataextract copied to clipboard

OCR support to extract text from image content

Open aarmora opened this issue 3 years ago • 7 comments

Describe the bug

When running the library, I end up with a page that has no text on it when the PDF does have text.

Example pdf page - https://ecorp.azcc.gov/CommonHelper/GetFilingDocuments?barcode=20073012585383

    const axiosResponse = await axios.get('https://www.chministries.org/media/5043/needsprocessingpacketweb.pdf', {
        responseType: 'arraybuffer',
        headers: {
            'Accept': 'application/pdf'
        }
    });

    const data = await PdfData.extract(axiosResponse.data as any);

    console.log('data text', data);

Using this kind of code works fine with other PDF pages.

Are there certain types of PDF formats that need to be converted differently?

Describe the expected behavior

I would expect there to be text displayed in the text attributes.

What is your Node.js version?

14.x.x

What operating system are you seeing the problem on?

Windows

Operating system version (or if other, then please fill in complete name and version)

10

Relevant log output

PdfData {
  pages: 2,
  text: [ '', '' ],
  fingerprint: '6477b4bbd76f64e67cf1d9c14a5d27c6',
  info: [Object: null prototype] {
    PDFFormatVersion: '1.2',
    IsLinearized: false,
    IsAcroFormPresent: false,
    IsXFAPresent: false,
    IsCollectionPresent: false,
    IsSignaturesPresent: false,
    Producer: ''
  }
}

Results -



### PDF File

[GetFilingDocuments.pdf](https://github.com/lublak/pdfdataextract/files/7340719/GetFilingDocuments.pdf)

aarmora avatar Oct 13 '21 19:10 aarmora

Hey @aarmora

thanks for the bug report. Currently my library only extracts embedded text from a PDF file. But in your PDF the text is embedded as an image. Therefore you need OCR for these pages. OCR recognizes the text from image files. Basically the biliothek https://github.com/naptha/tesseract.js would be useful for this. But Tesseract itself needs an image file, so the pages would have to be converted as an image first. I myself am interested in this functionality and would like to add it to pdfdataextract. Unfortunately I can't say when I will be ready with it. The first step that is currently in planning is to convert a PDF file into an image with https://github.com/Automattic/node-canvas. When this step is done, the next step will be the OCR functionality, which recognizes the image as text.

lublak avatar Oct 14 '21 06:10 lublak

@lublak That makes perfect sense. I appreciate your time.

How can you tell the difference between a pdf with an image and a pdf with text?

aarmora avatar Oct 14 '21 11:10 aarmora

@lublak I see, I can actually try to highlight the text with my cursor and it doesn't highlight on these PDFs.

You were very king and provided a ton of great information. Thanks so much!

aarmora avatar Oct 14 '21 11:10 aarmora

@aarmora just for your information. I started to implement it directly into my library: https://github.com/lublak/pdfdataextract/commit/1ed5a44e151d3b8d8dfb20b988b22a2fd37f7572

lublak avatar Oct 19 '21 09:10 lublak

Actually, you don't need that much to know to do it. Here is a working piece of code that uses tesseract.js and pdf-to-img:

const { createWorker } = require('tesseract.js');
const { pdf } = require("pdf-to-img");

const worker = createWorker({
    logger: m => console.log(m)
});

(async () => {
    const doc = await pdf("YOUR FILE PATH HERE", {scale: 2.0});
    await worker.load();
    await worker.loadLanguage('eng');
    await worker.initialize('eng');

    for await (const page of doc) {
        let { data: { text } } = await worker.recognize(page);
        console.log(text);
    }
    
    await worker.terminate();
})();

busybox11 avatar Nov 22 '21 15:11 busybox11

@busybox11 pdf-to-iamge is currently the same function as I described. It also uses the canvas library to render an image. However, I would like to provide this functionality myself in this library. I would like to support pure-image in addition to canvas. Because in certain areas native image libraries can not be used. For the ocr functionality, the image which comes from this canvas then "scanned" with tesseractjs. But also there I currently have another idea. PDF files consist not only of image information but of text information. So I want to make my OCR function a bit more intelligent. Instead of scanning the complete page I want to scan only the images. This has the advantage that pages that contain both (images and text) can be read more optimized.

(Is also currently in progress.)

lublak avatar Nov 22 '21 16:11 lublak

The current development I have now for the time being publicly pushed into an extra branch, to be found here: https://github.com/lublak/pdfdataextract/tree/contentinfoextractor

lublak avatar Dec 02 '21 14:12 lublak