pdf-issues icon indicating copy to clipboard operation
pdf-issues copied to clipboard

[RFE] Word spacing heuristic for text extraction

Open ceztko opened this issue 6 months ago • 6 comments

To my knowledge "word spacing" in the specification is a rendering concept only, and there's no documented heuristic to determine word termination on same line text when performing text extraction. While it's clear that the PDF specification support full semantical text only through document structure and tags, it's a relevant problem as one may assume different implementations to be as much consistent as possible when copying text on selections or performing searches, operations that generally ignore tags even when present. For example Adobe products seems to terminate words with a dynamic heuristic that weights the length of strings chunks, while pdf.js uses a comparison with a fixed size. Consider for example the attached TestWordSpacing.pdf content stream excerpt:

[(E)-125(E)]TJ      % "EE"
0 -20 TD
[(E)-126(E)]TJ      % "E E"
0 -20 TD
[(E)-152(N)]TJ      % "EN"
0 -20 TD
[(E)-153(N)]TJ      % "E N"
0 -20 TD
[(E)-194(M)]TJ      % "EM"
0 -20 TD
[(E)-195(M)]TJ      % "E M"
0 -20 TD
[(M)-194(E)]TJ      % "ME"
0 -20 TD
[(M)-195(E)]TJ      % "M E"
0 -20 TD
[(N)-152(E)]TJ      % "NE"
0 -20 TD
[(N)-153(E)]TJ      % "N E"
0 -20 TD
[(M)-194(N)]TJ      % "MN"
0 -20 TD
[(M)-195(N)]TJ      % "M N"
0 -20 TD
[(M)-194(M)]TJ      % "MM"
0 -20 TD
[(M)-195(M)]TJ      % "M M"
0 -20 TD
[(LLE)-143(EOO)]TJ  % "LLEEOO"
0 -20 TD
[(LLE)-144(EOO)]TJ  % "LLE EOO"
0 -20 TD
[(LLLLLLLLLLLLLLLE)-151(EOOOOOOOOOOOOOOOO)]TJ  % "LLLLLLLLLLLLLLLEEOOOOOOOOOOOOOOOO"
0 -20 TD
[(LLLLLLLLLLLLLLLE)-152(EOOOOOOOOOOOOOOOO)]TJ  % "LLLLLLLLLLLLLLLE EOOOOOOOOOOOOOOOO"

Commented is the text copied from Adobe Reader. In contrast pdf.js and pdium will always insert a space in place of the TJ operator advance. Foxit Reader instead seems to do a little better with shorter strings but fails similarly with just slightly larger chunks. The request here is to document a reference heuristic to determine word spacing for text extraction purposes, providing a counterpart for "word spacing" in rendering, which is a first citizen concept in the specification.

Attached the full text extraction results as tested with the different implementations. TestExtractionImpls.pdf

Image

ceztko avatar Jun 07 '25 23:06 ceztko

The request here is to document a reference heuristic to determine word spacing for text extraction purposes,

While I do understand that users may be interested in a consistent text extraction result, this IMO is more a request to PDF generators for creating properly marked PDFs than for documenting heuristics.

Heuristics by definition can be optimized/updated by taking more/newer input into account. Furthermore, heuristics may be optimized for specific kinds of inputs, e.g. scientific PDFs may generally be processed with better results by other heuristics than tabloids. Also different PDF processors may have access to different data than others, e.g. non-viewer text extractors don't necessarily have access to full font information.

Thus, by fixing a "reference heuristic" you create a reference that is bound to be sub-optimal and/or outdated the day it is published.

Nonetheless, it of course would be possible to collect ideas for best practices for space character inference during text extraction. This would not result in identical space character inference, though, but may result in improved text extraction results in general.

providing a counterpart for "word spacing" in rendering, which is a first citizen concept in the specification.

The heuristics you ask for actually are not a counterpart of the "word spacing" in rendering: The latter concept relies on the use of space characters (more exactly, single byte 0x20 character codes denoting word breaks) in text drawing instructions while the heuristics you ask for mostly have to be applied for PDFs avoiding the use of such space characters (0x20 character codes) while drawing text.

mkl-public avatar Jun 09 '25 10:06 mkl-public

this IMO is more a request to PDF generators for creating properly marked PDFs than for documenting heuristics.

I understand, but frankly: my purpose when opening this issue was to publicly discuss heuristics for the general (no space characters, no ActualText/Alt) case, since even several major PDF implementations fail to mimic Adobe Acrobat's results.

Nonetheless, it of course would be possible to collect ideas for best practices for space character inference during text extraction. This would not result in identical space character inference, though, but may result in improved text extraction results in general.

Yes, that's where the discussion could go. Specifically, a suggested heuristic collecting these best practices may be definitely published outside of the specification, like several technical documents endorsed by the PDF Association.

The heuristics you ask for actually are not a counterpart of the "word spacing" in rendering: The latter concept relies on the use of space characters (more exactly, single byte 0x20 character codes denoting word breaks) ...

I see your point. 14.8.2.6.2 Identifying word breaks talks more specifically about semantical word breaking, but in the context of tagged PDFs.

ceztko avatar Jun 09 '25 12:06 ceztko

since even several major PDF implementations fail to mimic Adobe Acrobat's results.

That in my opinion is no failure per se: Even though Acrobat is quite good in many aspects of PDF processing, it is not the best possible implementation of the spec. Thus, the best possible heuristics don't necessarily isn't Acrobat's.

mkl-public avatar Jun 09 '25 12:06 mkl-public

Thus, the best possible heuristics don't necessarily isn't Acrobat's.

We talked about the same topic already in SO few years ago (that was me as well) and certainly you didn't change your opinion. Looking at this problem in particular, and the test file attached, there's certainly much more more sophistication in the Adobe dynamic heuristic than the fixed separation size as in pdf.js. It seems also that the space "glyph" size is taken into account, something I do in PoDoFo as well, which is quite small in this font (250 thousandths of a unit, versus 500 thousandths of a unit for the character 'E'). That is to say that while we don't necessarily have to reverse engineer Acrobat, in this particular case the word breaking seems to be considerably better in Acrobat than the other major implementations I examined. Please let me know if you know other open source approaches I may have a look at.

ceztko avatar Jun 09 '25 13:06 ceztko

I'm not aware of a specific excelling open source implementation of space inference. I can merely imagine some interesting aspects in document that one should take into account. In particular, one should not only take the space glyph size into account but also the nominal widths in comparison to the actual widths of other glyphs in the same text line. Furthermore, if the PDF generator can be identified, you can apply specific strategies that you derived for this very generator, at least for a number of generators in wide use. Additionally you can apply dictionary analysis: is the position in question the likely end of a word?

mkl-public avatar Jun 09 '25 19:06 mkl-public

Word spacing is only "first class in the specification" for the purposes of the Tw operator to enable word spacing control while rendering. Anything beyond that is outside the previous and current PDF specifications/standards, with various implementations going to lesser and greater efforts to do a better job and gain a competitive advantage.

There have been previous discussions, both in ISO and here in the PDF Association, about improving information on text extraction/reuse, but in non-normative ways. So far, no one has stepped forward to volunteer to champion this, as it is a huge topic given the complexities of typesetting across all languages of the world! (Even Google Translate gets spacing wrong for Korean!)

petervwyatt avatar Jun 10 '25 03:06 petervwyatt