pdf2json icon indicating copy to clipboard operation
pdf2json copied to clipboard

Bad parsing of PDF file

Open gladykov opened this issue 1 year ago • 1 comments

html.pdf

Title of this PDF:

Sodalitas delectus ipsum aperio facere.

is extracted as

4PEBMJUBTEFMFDUVTJQTVNBQFSJPGBDFSF

This PDF was exported from Confluence by Atlassian.

pdfinfo

Title:           Sodalitas delectus ipsum aperio facere. - test-automation - Confluence
Creator:         Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/131.0.0.0 Safari/537.36
Producer:        Skia/PDF m131
CreationDate:    Fri Dec 27 07:20:30 2024 -03
ModDate:         Fri Dec 27 07:20:30 2024 -03
Custom Metadata: no
Metadata Stream: no
Tagged:          yes
UserProperties:  no
Suspects:        no
Form:            none
JavaScript:      no
Pages:           1
Encrypted:       no
Page size:       612 x 792 pts (letter)
Page rot:        0
File size:       15981 bytes
Optimized:       no
PDF version:     1.4

Used method

export async function parsePDF(filepath: string) {
    // https://github.com/modesty/pdf2json
    let parsed = false;

    /* eslint-disable-next-line */
    const pdfParser = new PDFParser(this, true);

    /* eslint-disable-next-line */
    pdfParser.on('pdfParser_dataError', (errData) => console.error(errData.parserError));
    /* eslint-disable-next-line */
    pdfParser.on('pdfParser_dataReady', (_) => {
        parsed = true;
    });

    /* eslint-disable-next-line */
    await pdfParser.loadPDF(filepath);

    let i = 0;
    const max = 5;
    while (!parsed && i < max) {
        await sleep(1, 'Waiting for parsed PDF');
        i += 1;
    }

    if (i === max && !parsed) {
        throw new Error('Timeout while waiting for parsed PDF');
    }

    /* eslint-disable-next-line */
    return unixifyLineEndings(pdfParser.getRawTextContent());
}

gladykov avatar Dec 27 '24 10:12 gladykov

The first line of text in the sample PDF uses type 3 font and custom encoding, which is not supported at this point, same as issue #363. Two options to move forward:

  1. submit PR to support type 3 font rendering in canvas.js
  2. recreate the PDF with standard TrueType font and standard encoding

modesty avatar Dec 30 '24 01:12 modesty

PR https://github.com/modesty/pdf2json/pull/401 adds supports for type3 glyph font support, should be fixed

modesty avatar Aug 24 '25 18:08 modesty

Thank you!

gladykov avatar Sep 08 '25 10:09 gladykov