pdf2json icon indicating copy to clipboard operation
pdf2json copied to clipboard

Blank text file when Parsing a PDF to create a .txt file but works with command line

Open jkomaragiri opened this issue 8 years ago • 10 comments

The code provided to create a text file produces a blank text file. The process to create a text file however works with running the command line argument. Here command line also produces the json and the text file.

let fs = require('fs'), PDFParser = require("./pdf2json/PDFParser");

let pdfParser = new PDFParser();

pdfParser.on("pdfParser_dataError", errData => console.error(errData.parserError) );
pdfParser.on("pdfParser_dataReady", pdfData => {
    fs.writeFile("./pdf2json/test/F1040EZ.content.txt", pdfParser.getRawTextContent());
});

pdfParser.loadPDF("./pdf2json/test/pdf/fd/form/F1040EZ.pdf");

my pdf file only contains text.

This has been also raised on the stackoverflow but no one has been able to resolve this.

Hope you can help.

Regards, Jai

jkomaragiri avatar Aug 01 '16 08:08 jkomaragiri

Having the same issue... getRawTextContent() outputs a blank string. Hope this is solved soon...

var PDFParser = require("pdf2json/pdfParser");

var pdfParser = new PDFParser();

pdfParser.on("pdfParser_dataError", errData => console.error(errData.parserError) );
pdfParser.on("pdfParser_dataReady", pdfData => {

    console.log("Raw - ");
    console.log(pdfParser.getRawTextContent());
});
pdfParser.loadPDF("./test3.pdf");

AshishGogna avatar Aug 01 '16 09:08 AshishGogna

Having same issue here too.

var fs = require("fs");

// https://github.com/modesty/pdf2json
var PDFParser = require("./node_modules/pdf2json/PDFParser");
var pdfParser = new PDFParser();


pdfParser.on("pdfParser_dataError", errData => console.error(errData.parserError));
pdfParser.on("pdfParser_dataReady", pdfData => {
    //console.log(JSON.stringify(pdfData))
    // console.log(pdfParser)
    // console.log(pdfParser.pages)
    console.log(pdfParser.getRawTextContent())
    fs.writeFile("./content.txt", pdfParser.getRawTextContent());
    // fs.writeFile("./content.json", JSON.stringify(pdfData));
});

pdfParser.loadPDF("./asdf.pdf");

xdvarpunen avatar Aug 01 '16 12:08 xdvarpunen

http://stackoverflow.com/questions/37757670/pdf2json-gives-me-a-blank-output-txt-file

xdvarpunen avatar Aug 01 '16 12:08 xdvarpunen

In order to getRawTextContent pdf.js requires to set needRawText attribute truthy: https://github.com/modesty/pdf2json/blob/master/lib/pdf.js#L223

xdvarpunen avatar Aug 01 '16 12:08 xdvarpunen

Okey guys, the frontpage documentation is a bit wrong! In order to make this work simply set to PDFParser parameters null and 1

xdvarpunen avatar Aug 01 '16 12:08 xdvarpunen

This one works:

var fs = require("fs");

// https://github.com/modesty/pdf2json
var PDFParser = require("./node_modules/pdf2json/PDFParser");
var pdfParser = new PDFParser(this,1);

pdfParser.on("pdfParser_dataError", errData => console.error(errData.parserError));
pdfParser.on("pdfParser_dataReady", pdfData => {
    console.log(pdfParser)
    fs.writeFile("./content.txt", pdfParser.getRawTextContent());
});

HTH -XDVarpunen

xdvarpunen avatar Aug 01 '16 12:08 xdvarpunen

Thanks a lot @xdvarpunen !

AshishGogna avatar Aug 01 '16 12:08 AshishGogna

Putting soon fix to pull request c:

-XDVarpunen

xdvarpunen avatar Aug 01 '16 12:08 xdvarpunen

Answered Stackoverflow question too @jkomaragiri c:

xdvarpunen avatar Aug 01 '16 12:08 xdvarpunen

This code parses two pdf files and converts to rawtext and removes the line starting with 'Generated' and then compares those two text files.This method is called more than once sequentially and for each call the arguments to the method changes, but the raw content is not getting replaced for the last call.

function compareGeneratedReportContent(samplePDFFile, sampleXLSXText) {
    const pdfParser = new PDFParser(this, true)
    const pdfParser2 = new PDFParser(this, true)
    const pdfExportPath = path.join(__dirname, '../../resources/report-test/PDFExportedFile.pdf')
    const xlsxExportPath = path.join(__dirname, '../../resources/report-test/XLSXExportedFile.xlsx')
    let content1 = ''
    let content2 = ''
    let result = false

    pdfParser2.on('pdfParser_dataError', errData => console.log(errData))
    pdfParser2.on('pdfParser_dataReady', () => {
        content2 = pdfParser2.getRawTextContent().replace(/^.*(Generated).+$/mg, '')
        // console.log('content2 ', content2)
        fs.writeFileSync(path.join(__dirname, '../../resources/report-test/sample.txt'), content2, 'utf-8')
    })
    pdfParser2.loadPDF(path.join(__dirname, `../../resources/report-test/${samplePDFFile}.pdf`))

    pdfParser.on('pdfParser_dataError', errData => console.log(errData))
    pdfParser.on('pdfParser_dataReady', () => {
        content1 = pdfParser.getRawTextContent().replace(/^.*(Generated).+$/mg, '')
        // console.log('content1 ', content1)
        fs.writeFileSync(path.join(__dirname, '../../resources/report-test/generated.txt'), content2, 'utf-8')
    })
    pdfParser.loadPDF(pdfExportPath)

    let readContent2 = fs.readFileSync(path.join(__dirname, '../../resources/report-test/sample.txt'), 'utf-8')
    let readContent1 = fs.readFileSync(path.join(__dirname, '../../resources/report-test/generated.txt'), 'utf-8')
    if (readContent2 === readContent1) {
        console.log('Report pdf file content matches')
        result = true
    } else {
        console.log('Error in matching contents of report pdf')
        result = false
    }
}

Can anybody help? Regards, Fazi

Fazila-A avatar Apr 11 '18 14:04 Fazila-A