PdfPig icon indicating copy to clipboard operation
PdfPig copied to clipboard

Not zero width for undetected characters

Open ivan-perezhogin-sto opened this issue 3 years ago • 5 comments

Hi. First of all, I want to thank you for such a great library and your hard work, it really helps me a lot. I have a document with a lot of tabulation (\t) inside text with Type1Standard14Font (and it doesn't contain this character). And I don't know why, but you are setting 2.5 width for such unknown symbols, and this leads to unreliable bounding boxes for other characters in this text operator. All the renders I've tested and all readers don't show such characters at all (so the width for such characters is 0), so I think it's better to use zero-width or at least to add an option (delegate or just property) for this case inside ParsingOption. Sorry that I've not attached the file, but it contains some personal info and I can't do it. Here is a screenshot of how such a text operators are looked at inside the stream and how coordinates looked on the rendered page TabulationWidth

ivan-perezhogin-sto avatar Jan 27 '22 02:01 ivan-perezhogin-sto

Trying to solve this issue I've found one more problem. TextSequence property for letters is never changed (and it's always 0) while the parser goes through Tj operators. So it's also impossible to group letters by the operator they are placed in. And it seems that all text ordering recognizers (IReadingOrderDetector) from DocumentLayoutAnalysis rely on this property and not work properly because of that

ivan-perezhogin-sto avatar Jan 28 '22 04:01 ivan-perezhogin-sto

After trying to resolve the issue I ended up with a solution with replacing \t symbols in the content stream with space and it seems that now everything is working fine. So I was wrong that pdf readers are ignoring such a character. They are just not rendered but still have non-zero (and not 250) widths.

ivan-perezhogin-sto avatar Jan 31 '22 02:01 ivan-perezhogin-sto

Just a quick note on your TextSequence: I'm a bit surprised... I'm almost 100% sure that it was working when I developped these IReadingOrderDetector... so either something changed since then, or your document is particular...

BobLd avatar Jan 31 '22 12:01 BobLd

Just a quick note on your TextSequence: I'm a bit surprised... I'm almost 100% sure that it was working when I developped these IReadingOrderDetector... so either something changed since then, or your document is particular...

Judging by the code it's incrementing only for TJ operator (and not for Tj)

ivan-perezhogin-sto avatar Jan 31 '22 12:01 ivan-perezhogin-sto

okay thanks for that. I'm not sure what would be the expected behaviour...

Anyway the first issue you raised still remains, and I don't know the solution. Did you look at how PdfBox handles that?

I'm putting below the official documentation about Tj/TJ for reference:

image image

BobLd avatar Jan 31 '22 12:01 BobLd

It sounds like this is either a won't do or was resolved in some way? Closing for now

EliotJones avatar May 28 '23 17:05 EliotJones