polyfile icon indicating copy to clipboard operation
polyfile copied to clipboard

Fix handling of empty lists and malformed PDF dictionary values

Open pbottine opened this issue 1 month ago • 0 comments

This fixes issue #12 where certain malformed PDFs would cause "List index out of range" errors during parsing.

Changes to PDFList.load():

  • Handle empty lists by returning zero-length wrapper at offset 0
  • Filter items to only those with offset information before calculating bounds
  • Log warnings when items lack position data instead of using incorrect defaults
  • Change from @staticmethod to @classmethod for better conventions
  • Add comprehensive docstring explaining edge case behavior

Changes to parse_object() dictionary handling:

  • Fix logic ordering: check isinstance(value, list) before checking emptiness This prevents skipping falsy but valid values like 0, False, or empty strings
  • Gracefully skip empty lists in dictionaries (log debug message)
  • Catch ValueError from PDFList.load() for truly malformed lists
  • Log warnings instead of raising exceptions for unexpected values
  • Continue parsing to extract maximum data from malformed PDFs
  • Update dictionary to keep it self-consistent after wrapping lists

Testing:

  • Added comprehensive unit tests in tests/test_pdf.py covering:
    • Empty lists
    • Lists with/without offset information
    • Mixed offset scenarios
    • Empty lists in dictionaries
    • Malformed list values
    • Unexpected dictionary values
    • Preservation of falsy but valid values (0, False)
  • All 8 new tests pass
  • All 23 existing unit tests continue to pass
  • PDF parsing verified with testdata/javascript.pdf

This builds on the approach from PR #3426 by @mrscottyrose with corrections to logic ordering, offset calculation, and comprehensive test coverage.

Fixes #12

cc: @smoelius

🤖 Generated with Claude Code

pbottine avatar Nov 26 '25 20:11 pbottine