Erroneous Whitespace in Text Extraction
Erroneous whitespace is added within words during text extraction. The errors are inconsistent and not always possible (or ever easy) to resolve post-extraction. For example, the text MISSION STORE MONITOR.RESERVED (verified copy/pasting from Acrobat) is extracted as WO RD NAME: M ISSION S TORE MONITOR.RESERVED
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Linux-5.10.228-219.884.amzn2.x86_64-x86_64-with-glibc2.35
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.3.1, crypt_provider=('cryptography', '3.4.8'), PIL=10.2.0
Code + PDF
This is a minimal, complete example that shows the issue:
reader = pypdf.PdfReader('./spacey-clean.pdf')
page, = reader.pages
page.extract_text()
' 01 January 1969 \n \n5-\n208 \nABCD-01234-012 Revision A \n5.16.6 WO RD NAME: MISSION STORE MONITOR.RESERVED \n CATEGORY: N/A \nWORD ID: MAX VALUE: N/A \nSOURCE(s): MIN VALUE: N/A \nDEST(s): RESOLUTION: N/A \nCOMP RATE: ACCURACY: N/A \nXMIT RATE: MSB: N/A \nSIGNAL TYPE: LSB: N/A \nUNITS: \n69Q-04...20 \nWeapon \nA/C or Carriage System \nN/A \nAperiodic \nN/A \nN/A \nFULL SCALE: N/A \n FIELD NAME BIT NO. DESCRIPTION \n \nReserved - 00 -0 \n \n - 01 -0 \n \n - 02 -0 \n \n - 03 -0 \n \n - 04 -0 \n \n - 05 -0 \n \n - 06 -0 \n \n - 07 -0 \n \n - 08 -0 \n \n - 09 -0 \n \n - 10 -0 \n \n - 11 -0 \n \n - 12 -0 \n \n - 13 -0 \n \n - 14 -0 \n \n - 15 -0 \n \nREMARKS/NOTES: \n1. Reserved per MIL-STD-1760 \n \n \n '
Notice the whitespace within "WORD". The issue gets worse when exttraction_mode='layout':
page.extract_text(extraction_mode='layout')
'ABCD-01234-012 Revision A 01 January 1969\n\n5.16.6 WO RD NAME: M ISSION S TORE MONITOR.RESERVED\n CATEGORY: N/A\n WORD ID: MAX VALUE: N/A69Q-04...20\n SOURCE(s): MIN VALUE: N/AWeapon\n DEST(s): RESOLUTION: N/AA/C or Ca rriage S ystem\n COMP RAT E: ACCURACY: N/AN/A\n XMIT RATE: MSB: N/AAperiodic\n SIGNAL TYPE: LSB: N/AN/A\n UNITS: N/A FULL SCALE: N/A\n FIELD NAME BIT NO. DESCRIPTION\n\n Reserved - 00 -0\n\n - 01 -0\n\n - 02 -0\n\n - 03 -0\n\n - 04 -0\n\n - 05 -0\n\n - 06 -0\n\n - 07 -0\n\n - 08 -0\n\n - 09 -0\n\n - 10 -0\n\n - 11 -0\n\n - 12 -0\n\n - 13 -0\n\n - 14 -0\n\n - 15 -0\n\n REMARKS/NOTES:\n 1. Reserved per MIL -STD-1760\n\n\n\n\n\n\n\n\n\n\n\n\n\n 5-208'
Not only are there more whitespace errors, the horizontal spacing is not representative of the source document.
I have modified the original text of the document to make it publicly releasable. Feel free to use it in tests or ask for more examples. I have something around 100k pages of examples. spacey-clean.pdf
Thanks for the report. As mentioned inside the docs, text extraction is very hard and probably never perfect for all cases. Apart from playing with the parameters available, there is not much we are able to do out of the box.
You are of course invited to further investigate this and propose a corresponding PR nevertheless.
Many of these issues show within a TJ operator with unusual character spacing -- but at least within extraction_mode='plain', I wouldn't expect character spacing to be good reason to insert whitespace. I think that treating a TJ like a Tj and ignoring character spacing would be correct. But if I am misunderstanding the usage of TJ or the purpose of the 'plain' extraction mode, I agree the solution is not obvious.
The other instance these issues appear is then an Td seems to do nothing. In this case it's a "new line" with 0 y offset and the x offset puts the start of the next string right at the need of the previous one (at least visually).
print(page.get_contents().operations)
[
...,
(['/TT0', 12], b'Tf'),
([0], b'Tw'),
([' WO'], b'Tj'),
([-0.024], b'Tc'),
([0.024], b'Tw'),
([33.356, 0], b'Td'),
([['RD NAM', 2.167, 'E', -5, ':']], b'TJ'),
...,
]
Either way, it seems odd that Acrobat and pypdf give different results -- I think because of lines:
tm_matrix[4] += tx * tm_matrix[0] + ty * tm_matrix[2]
tm_matrix[5] += tx * tm_matrix[1] + ty * tm_matrix[3]
in pypdf._page:PageObject._extract_text.locals().process_operation.
Before and after these lines, the tm_matrix is [1.0, 0.0, 0.0, 1.0, 66.0, 717.24] and [1.0, 0.0, 0.0, 1.0, 66.0, 717.24], respectively. This appears that the test is offset from the current x position rather than the start of the next line. But, a cheat sheet I found online here says:
Move to the start of the next line, offset from the start of the current line by (t x, ty). t x and ty are numbers expressed in unscaled text space units
As mentioned before, getting everything correct is nearly impossible, as PDF files are usually meant for viewing, not automated content processing, and thus lots of special cases might be involved. If you think you have found some possible bug or room for improvement here, you are always invited to propose corresponding PRs.