Extraction issue - Letter horizontal positioning / compressed glyph widths (Aspose‑generated PDF)
Hi there,
I have attached a sample PDF document with the following traits:
- Created with Aspose PDF for Java
- A Y inverted CTM (The very first operator is 1 0 0 -1 0 792 cm)
- Text objects with their own Y inversion
I observe the following issues:
- Page.Letters (and derived Words/Blocks) report letter positions that are too close together (resulting bounding boxes appear “compressed” when reconstructed).
- In cases where a paragraph should be continuous across the page, the Letter positions "reset" mid-line and cause block extraction to consider that text a new block. I imagine this is because the previous Letter position would be too far back (not present in sample file)
- In the screenshot below, attempting to draw bounding boxes results in a flipped vertical axis, although this is rectified upon resaving the PDF with "File.WriteAllBytes(path, pageBuilder.Build());", then drawing upon that new PDF file. This only affects drawing, the extracted Letter objects have the correct Y position but retain the compressed spacing.
Top of page:
Bottom of page (Blocks and Lines Marked):
File: sample.pdf
I was running PdfPig 0.1.12-alpha-20250728 for the screenshots.
I then updated to the latest nuget package: PdfPig 0.1.12-alpha-20251002. This minorly changed the line heights, but the glyph width and upside down drawing still persisted:
Please let me know if you have any insight or need anything else from me. Thank you!
@jonathandlo I think there are 2 issues here:
- The text is at the top of the document, but when you draw it - it is at the bottom.
- To fix that, you need to correct the location of the bboxes by the height of the page (
pageHeight - bbox.Y). This is expected as pdf coordinate system is bottom to top, and drawing libraries will have a coordinate system that is top to bottom. If I misunderstood, let me know
- To fix that, you need to correct the location of the bboxes by the height of the page (
- Second issue is that the letters (and bounding boxes) are not correct (independant of the coordinate system). I believe this is due to how letters or matrices are processed in this document - I'll try to have a look
This is what the letters render to:
Hi @BobLd,
- What would be the proper way to draw to the page for debugging? In the example I provided, the text itself is also rendered to the page upside down. This issue (and the coordinate system flip) is resolved when PdfPig saves the file and reopens it. I am currently using the document marking code from the documentation.
- Yes, this is my understanding of the issue as well. Thank you for looking into it
HI @jonathandlo I'll have time this weekend to have a look. Do you mind sharing the saved document by PdfPig?
I'll compare the 2 to try to understand
Hi @BobLd, sorry for the delay; I've been busy as of late.
I can get that to you, but you could also obtain it by running the marking code from the wiki, just without the marking.
The bigger issue though is the glyph/character width in the sample.pdf, as that is causing block and line detection to break often with the files I am working with.
Hi there, for the glyph/character bounding box/width issue, is there anything else I can provide to help? Due to this issue, I currently cannot reliably extract a full line of paragraph text in the files I am working with. I can try to find another scrubbed sample where this happens on the latest version if this would help?