hexapdf icon indicating copy to clipboard operation
hexapdf copied to clipboard

HexaPDF Fails to Detect Pages in a PDF Document

Open Jorge-Signwell opened this issue 1 year ago • 3 comments

Issue: HexaPDF Fails to Detect Pages in a PDF Document

Description:

I am encountering an issue with the HexaPDF gem where it fails to detect the pages in a PDF document. The document has 17 pages, but when I attempt to open it and count the pages using HexaPDF, it returns a count of 0.

Steps to Reproduce:

  1. Create a PDF document with multiple pages (the document I used has 17 pages).

  2. Use the following Ruby script to open the PDF and count its pages:

    require 'hexapdf'
    
    path = './output.pdf'
    
    # Verify if the file exists before attempting to open it
    document = HexaPDF::Document.open(path)
    
    puts document.pages.count
    
  3. Run the script.

    ❯ ruby main.rb
    0
    

Expected Behavior:

The script should output the correct number of pages in the PDF (17 in this case).

Actual Behavior:

The script outputs 0, indicating that no pages are detected in the PDF.

Additional Information:

❯ hexapdf info --check output.pdf 

WARNING: Parse error at position 0: PDF file trailer with end-of-file marker not found - trying cross-reference table reconstruction
WARNING: Validation error for trailer: ID field should always be set (correctable)
WARNING: Validation error for sub-object of object type Catalog (2,0): A PDF document needs a page tree (correctable)
WARNING: Validation error for object type Pages (407,0): A PDF document needs at least one page (correctable)
ERROR: Stream of object (73,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (74,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (75,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (76,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (77,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (78,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (79,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (80,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (66,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (69,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (71,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (72,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (81,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (82,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (89,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (90,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (91,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (92,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (93,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (94,0) invalid: Problem while decoding Flate encoded stream: unknown compression method
ERROR: Stream of object (97,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (101,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (107,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (110,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (116,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (122,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (125,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (126,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (132,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (141,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (145,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (153,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (163,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (170,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (173,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (176,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (179,0) invalid: Problem while decoding Flate encoded stream: unknown compression method
ERROR: Stream of object (182,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (185,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (188,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (191,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (194,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (195,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (196,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (208,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (209,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (232,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (233,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (245,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (246,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (256,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (270,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (272,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (273,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (274,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (275,0) invalid: Problem while decoding Flate encoded stream: unknown compression method
ERROR: Stream of object (276,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (277,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (278,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (280,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (282,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (284,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (285,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (287,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (289,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (292,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (293,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (295,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (297,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (300,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (302,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (304,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (306,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (308,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (309,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (311,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (314,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (315,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (317,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (320,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (337,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (360,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (367,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (368,0) invalid: Problem while decoding Flate encoded stream: unknown compression method
ERROR: Stream of object (369,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (370,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (371,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (372,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (373,0) invalid: Problem while decoding Flate encoded stream: unknown compression method
ERROR: Stream of object (374,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (375,0) invalid: Problem while decoding Flate encoded stream: unknown compression method
ERROR: Stream of object (376,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (377,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (378,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (379,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (380,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (381,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (382,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (383,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (384,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (385,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (386,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (387,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (388,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (389,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (390,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (391,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (392,0) invalid: Problem while decoding Flate encoded stream: unknown compression method
ERROR: Stream of object (393,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (394,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (395,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (396,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (397,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (399,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (400,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (401,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (402,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (403,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (404,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
ERROR: Stream of object (405,0) invalid: Problem while decoding Flate encoded stream: incorrect header check
File name:          output.pdf
File size:          1846563 bytes
Pages:              1
Version:            1.5
Reconstructed:      yes (use --check for details)

Please let me know if you need any additional information to investigate this issue. Thank you for your assistance.

Jorge-Signwell avatar May 17 '24 19:05 Jorge-Signwell

Okay, that file seems to be corrupt since HexaPDF needs to do a cross-reference reconstruction which means it parses the file from top to bottom and tries to find all PDF objects. The current algorithm to do this works for many slightly corrupt or invalid files but certainly doesn't work for all files.

To find out the real cause of why HexaPDF can't reconstruct the page tree, I would need to inspect the file. If possible, attach it to the issue or otherwise please send it to [email protected]. Without that file I won't be able to help you.

gettalong avatar May 17 '24 21:05 gettalong

Hi @gettalong,

Following up on my previous comment, I've sent the corrupt PDF file to [email protected] for your reference.

Hopefully, this will help diagnose the issue with reconstructing the page tree.

Thanks again for your help!

Jorge-Signwell avatar May 21 '24 19:05 Jorge-Signwell

@Jorge-Signwell Thanks for the file! I found the problem and will implement a fix.

gettalong avatar May 21 '24 21:05 gettalong

@Jorge-Signwell I have fixed parsing of such invalid files and they work fine now. Release 0.43.0 with the fix will be available within the hour.

gettalong avatar May 26 '24 21:05 gettalong