PdfPig icon indicating copy to clipboard operation
PdfPig copied to clipboard

Extraction issue - Letter horizontal positioning / compressed glyph widths (Aspose‑generated PDF)

Open jonathandlo opened this issue 3 months ago • 5 comments

Hi there,

I have attached a sample PDF document with the following traits:

  • Created with Aspose PDF for Java
  • A Y inverted CTM (The very first operator is 1 0 0 -1 0 792 cm)
  • Text objects with their own Y inversion

I observe the following issues:

  • Page.Letters (and derived Words/Blocks) report letter positions that are too close together (resulting bounding boxes appear “compressed” when reconstructed).
  • In cases where a paragraph should be continuous across the page, the Letter positions "reset" mid-line and cause block extraction to consider that text a new block. I imagine this is because the previous Letter position would be too far back (not present in sample file)
  • In the screenshot below, attempting to draw bounding boxes results in a flipped vertical axis, although this is rectified upon resaving the PDF with "File.WriteAllBytes(path, pageBuilder.Build());", then drawing upon that new PDF file. This only affects drawing, the extracted Letter objects have the correct Y position but retain the compressed spacing.

Top of page: Image

Bottom of page (Blocks and Lines Marked): Image

File: sample.pdf

I was running PdfPig 0.1.12-alpha-20250728 for the screenshots. I then updated to the latest nuget package: PdfPig 0.1.12-alpha-20251002. This minorly changed the line heights, but the glyph width and upside down drawing still persisted: Image

Please let me know if you have any insight or need anything else from me. Thank you!

jonathandlo avatar Oct 03 '25 01:10 jonathandlo

@jonathandlo I think there are 2 issues here:

  • The text is at the top of the document, but when you draw it - it is at the bottom.
    • To fix that, you need to correct the location of the bboxes by the height of the page (pageHeight - bbox.Y). This is expected as pdf coordinate system is bottom to top, and drawing libraries will have a coordinate system that is top to bottom. If I misunderstood, let me know
  • Second issue is that the letters (and bounding boxes) are not correct (independant of the coordinate system). I believe this is due to how letters or matrices are processed in this document - I'll try to have a look

This is what the letters render to: Image

BobLd avatar Oct 15 '25 12:10 BobLd

Hi @BobLd,

  • What would be the proper way to draw to the page for debugging? In the example I provided, the text itself is also rendered to the page upside down. This issue (and the coordinate system flip) is resolved when PdfPig saves the file and reopens it. I am currently using the document marking code from the documentation.
  • Yes, this is my understanding of the issue as well. Thank you for looking into it

jonathandlo avatar Oct 17 '25 22:10 jonathandlo

HI @jonathandlo I'll have time this weekend to have a look. Do you mind sharing the saved document by PdfPig?

I'll compare the 2 to try to understand

BobLd avatar Nov 01 '25 08:11 BobLd

Hi @BobLd, sorry for the delay; I've been busy as of late.

I can get that to you, but you could also obtain it by running the marking code from the wiki, just without the marking.

The bigger issue though is the glyph/character width in the sample.pdf, as that is causing block and line detection to break often with the files I am working with.

jonathandlo avatar Nov 05 '25 05:11 jonathandlo

Hi there, for the glyph/character bounding box/width issue, is there anything else I can provide to help? Due to this issue, I currently cannot reliably extract a full line of paragraph text in the files I am working with. I can try to find another scrubbed sample where this happens on the latest version if this would help?

jonathandlo avatar Dec 04 '25 00:12 jonathandlo