Fix stream change mid token
Pull request
Fixes #1157
Update psparser to try to parse in-progress tokens by tacking on whitespace if we change stream in the middle of a token. Follows the approach of #1030 but for this new edge case on a stream boundary.
How Has This Been Tested?
Ran the repro script from the linked issue. Observed that the coordinates are correct.
Checklist
- [x] I have read CONTRIBUTING.md.
- [x] I have added a concise human-readable description of the change to CHANGELOG.md.
- [x] I have tested that this fix is effective or that this feature works.
- [x] I have added docstrings to newly created methods and classes.
- [x] I have updated the README.md and the readthedocs documentation. Or verified that this is not necessary.
This seems simple and correct, even though I took a solemn vow to never look at or touch pdfminer.psparser again at the cost of my mortal soul.
If I understand correctly, you have simply added a return value to fillfp which indicates if the stream has changed, which causes a token to be emitted, using my hack of tacking on some whitespace. This works because the (useless and inefficient, I am compelled to add) buffering done by the parser will always empty the buffer between streams (https://github.com/pdfminer/pdfminer.six/pull/1158/files#diff-29915b7598fe1f537a2c838603581c6d412ba2bd7b721b5d05a4794c6b0a538cL211).
I suspect this bug is rarely triggered because even though the standard actually says the final newline in a stream "SHALL" (not "should") not be counted in the stream length, it often is, and so, you get the delimiter for free.
I would suggest testing this on a bunch of pdfs if possible, at least the collection of test cases in pdfplumber, to make sure it doesn't introduce any regressions...
Thanks for the quick review! Yes, all this does is add a return value to fillfp and fillbuf exactly as you've described.
I'll test this against pdfplumber and report back.
@dhdaines the pdfplumber suite uncovered a bug where the first byte of a new stream was dropped. after fixing in b615f5d, the pdfplumber tests are also passing.
python -m pytest
======================================================================================================================================= test session starts =======================================================================================================================================
platform darwin -- Python 3.13.4, pytest-8.4.1, pluggy-1.6.0
rootdir: pdfplumber
configfile: setup.cfg
plugins: cov-6.2.1
collected 171 items
tests/test_basics.py ..................... [ 12%]
tests/test_ca_warn_report.py ..... [ 15%]
tests/test_convert.py ................ [ 24%]
tests/test_ctm.py . [ 25%]
tests/test_dedupe_chars.py ..... [ 28%]
tests/test_display.py ................ [ 37%]
tests/test_issues.py ........................ [ 51%]
tests/test_laparams.py .... [ 53%]
tests/test_list_metadata.py . [ 54%]
tests/test_mcids.py . [ 54%]
tests/test_nics_report.py ..... [ 57%]
tests/test_oss_fuzz.py . [ 58%]
tests/test_repair.py ....... [ 62%]
tests/test_structure.py ................ [ 71%]
tests/test_table.py ............ [ 78%]
tests/test_utils.py .................................... [100%]
========================================================================================================================================= tests coverage ==========================================================================================================================================
________________________________________________________________________________________________________________________ coverage: platform darwin, python 3.13.4-final-0 _________________________________________________________________________________________________________________________
Name Stmts Miss Cover
------------------------------------------------------
pdfplumber/__init__.py 8 0 100%
pdfplumber/_typing.py 9 0 100%
pdfplumber/_version.py 2 0 100%
pdfplumber/cli.py 69 0 100%
pdfplumber/container.py 107 0 100%
pdfplumber/convert.py 56 0 100%
pdfplumber/ctm.py 27 0 100%
pdfplumber/display.py 159 0 100%
pdfplumber/page.py 353 0 100%
pdfplumber/pdf.py 120 0 100%
pdfplumber/repair.py 28 0 100%
pdfplumber/structure.py 310 0 100%
pdfplumber/table.py 329 0 100%
pdfplumber/utils/__init__.py 5 0 100%
pdfplumber/utils/clustering.py 38 0 100%
pdfplumber/utils/exceptions.py 4 0 100%
pdfplumber/utils/generic.py 11 0 100%
pdfplumber/utils/geometry.py 128 0 100%
pdfplumber/utils/pdfinternals.py 55 0 100%
pdfplumber/utils/text.py 315 0 100%
------------------------------------------------------
TOTAL 2133 0 100%
Coverage XML written to file coverage.xml
====================================================================================================================================== 171 passed in 31.32s =======================================================================================================================================
In addition, I ran pdf2txt.py on the master branch and on this change on a batch of 290 confidential PDFs using the following script:
import os
import glob
import subprocess
from concurrent.futures import ProcessPoolExecutor, as_completed
import argparse
def process_pdf(pdf_path):
command = ['python', '-m', 'tools.pdf2txt', pdf_path]
try:
subprocess.run(command, check=True, capture_output=True, text=True)
return True, pdf_path, None
except subprocess.CalledProcessError as e:
return False, pdf_path, e.stderr
def main():
parser = argparse.ArgumentParser(description="Process PDF files in a directory.")
parser.add_argument("root_dir", help="The root directory to search for PDF files.")
args = parser.parse_args()
root_dir = args.root_dir
pdf_files = glob.glob(os.path.join(root_dir, '**', '*.pdf'), recursive=True)
if not pdf_files:
print("No PDF files found.")
return
print(f"Found {len(pdf_files)} PDF files to process.")
failed_count = 0
with ProcessPoolExecutor(max_workers=10) as executor:
futures = {executor.submit(process_pdf, pdf_file): pdf_file for pdf_file in pdf_files}
for future in as_completed(futures):
success, pdf_path, error_message = future.result()
if success:
print(f"Successfully processed: {pdf_path}")
else:
failed_count += 1
print(f"Failed to process: {pdf_path}")
if error_message:
print(f" Error: {error_message.strip()}")
print(f"\nProcessing complete.")
print(f"Total files processed: {len(pdf_files)}")
print(f"Successful: {len(pdf_files) - failed_count}")
print(f"Failures: {failed_count}")
if __name__ == "__main__":
main()
the output of both branches was
Processing complete.
Total files processed: 290
Successful: 290
Failures: 0
Great! I'm not the actual maintainer of pdfminer so I can't approve this but it does look good to me :)
I suspect this bug is rarely triggered because even though the standard actually says the final newline in a stream "SHALL" (not "should") not be counted in the stream length, it often is, and so, you get the delimiter for free.
Actually, ISO 32000, Table 31, Contents entry is the only place in PDF where an array of content streams is explicitly defined and supported and it formally states "If the value is an array, the effect shall be as if all of the streams in the array were concatenated with at least one white-space character added between the streams’ data, in order, to form a single stream."
Actually, ISO 32000, Table 31, Contents entry is the only place in PDF where an array of content streams is explicitly defined and supported and it formally states "If the value is an array, the effect shall be as if all of the streams in the array were concatenated with at least one white-space character added between the streams’ data, in order, to form a single stream."
Exactly! Thanks also for the clarification in PDF 2.0, because PDF 1.7 (ISO 32000-2008) doesn't mention the extra whitespace character.
But I was making a point about real-world PDFs which often have a /Length value that includes the extra newline even though it isn't supposed to do that.
Thanks also for the clarification in PDF 2.0, because PDF 1.7 (ISO 32000-2008) doesn't mention the extra whitespace character. Thats because it was fast-tracked... but it's always been like this.
Note that there is still some ambiguity in the wording as the whitespace needs to be between tokens (i.e. not in the middle of a string, whether it be a literal or hex string!).