dialoqbase icon indicating copy to clipboard operation
dialoqbase copied to clipboard

PDF import fails.

Open cidrugHug8 opened this issue 2 years ago • 6 comments
trafficstars

Hi, The following error was output to the docker log.

loading pdf
Warning: Indexing all PDF objects
Error
    at InvalidPDFExceptionClosure (/app/node_modules/pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js:452:35)
    at Object.<anonymous> (/app/node_modules/pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js:455:2)
    at __w_pdfjs_require__ (/app/node_modules/pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js:45:30)
    at Object.<anonymous> (/app/node_modules/pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js:7939:23)
    at __w_pdfjs_require__ (/app/node_modules/pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js:45:30)
    at /app/node_modules/pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js:88:18
    at /app/node_modules/pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js:91:10
    at webpackUniversalModuleDefinition (/app/node_modules/pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js:18:20)
    at Object.<anonymous> (/app/node_modules/pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js:25:3)
    at Module._compile (node:internal/modules/cjs/loader:1254:14) {
  message: 'Invalid PDF structure'
}

cidrugHug8 avatar Jun 08 '23 05:06 cidrugHug8

Hi there, I'm sorry for the PDF loading issue you encountered. Could you please confirm whether you used a protected PDF file? This information will help me better understand the problem and provide you with the appropriate solution. Thank you!

n4ze3m avatar Jun 08 '23 06:06 n4ze3m

UPDATE: I tested with a password-protected PDF file, and it failed to process. I will figure out how to resolve this issue.

n4ze3m avatar Jun 08 '23 06:06 n4ze3m

Thank you for your quick reply. PDF file is not encrypted. pdfinfo results are as follows.

$ pdfinfo  Engineer\ Reference.pdf 
Title:           My Document
Subject:         
Keywords:        
Author:          yamada
Producer:        madbuild
CreationDate:    Wed Jun 22 11:02:51 2022 JST
ModDate:         Wed Jun 22 11:02:51 2022 JST
Custom Metadata: no
Metadata Stream: no
Tagged:          no
UserProperties:  no
Suspects:        no
Form:            none
JavaScript:      no
Pages:           175
Encrypted:       no
Page size:       595.28 x 841.89 pts (A4)
Page rot:        0
File size:       25883386 bytes
Optimized:       no
PDF version:     1.4

cidrugHug8 avatar Jun 08 '23 06:06 cidrugHug8

same i get this error: PDF.js v2.9.359 (build: e667c8cbc) Message: Invalid PDF structure.

but not all pdfs, if the pdf is a bit large this happens

noureldinz3r0 avatar Jun 11 '23 16:06 noureldinz3r0

same i get this error: PDF.js v2.9.359 (build: e667c8cbc) Message: Invalid PDF structure.

but not all pdfs, if the pdf is a bit large this happens

Yes, I am currently using pdf-parse as a document loader, but it cannot handle large files. So, I am trying to set up a custom loader that will split large PDFs into smaller ones and then feed them to pdf-parse.

n4ze3m avatar Jun 11 '23 17:06 n4ze3m

I have the same problem and splitting the large file into multiple 25-pages sections works.

So for anyone who has the same issue, for now you could try PyPDF2 in Python to split a PDF into separate files:

  1. pip install PyPDF2
  2. python pdf_splitter.py

import PyPDF2

pdf_path = "path/to/your/pdf.pdf" pages_per_file = 25

with open(pdf_path, "rb") as file: reader = PyPDF2.PdfReader(file) total_pages = len(reader.pages)

file_number = 1
page_count = 0
writer = PyPDF2.PdfWriter()

for page_number in range(total_pages):
    writer.add_page(reader.pages[page_number])
    page_count += 1

    if page_count == pages_per_file or page_number == total_pages - 1:
        output_filename = f"output_file_{file_number}.pdf"
        with open(output_filename, "wb") as output_file:
            writer.write(output_file)

        # Reset the page count and create a new writer for the next file
        page_count = 0
        file_number += 1
        writer = PyPDF2.PdfWriter()

  1. python pdf_splitter.py

MY221B avatar Jun 12 '23 14:06 MY221B

Hey guys, I have created a custom PDF loader on v0.0.12. I hope it resolves the issue with large PDF files. Please try the latest version and let me know.

Note that the PDF loader still can't load protected PDF files.

n4ze3m avatar Jun 19 '23 16:06 n4ze3m

I'm able to upload PDFs with thousand of pages with v0.0.12. Fixed it for me

l4time avatar Jun 20 '23 15:06 l4time

Closing this issue based on the comment above. Feel free to reopen if the problem still exists. Thank you.

n4ze3m avatar Jun 27 '23 08:06 n4ze3m