polyfile
polyfile copied to clipboard
Fix handling of empty lists and malformed PDF dictionary values
This fixes issue #12 where certain malformed PDFs would cause "List index out of range" errors during parsing.
Changes to PDFList.load():
- Handle empty lists by returning zero-length wrapper at offset 0
- Filter items to only those with offset information before calculating bounds
- Log warnings when items lack position data instead of using incorrect defaults
- Change from @staticmethod to @classmethod for better conventions
- Add comprehensive docstring explaining edge case behavior
Changes to parse_object() dictionary handling:
- Fix logic ordering: check isinstance(value, list) before checking emptiness This prevents skipping falsy but valid values like 0, False, or empty strings
- Gracefully skip empty lists in dictionaries (log debug message)
- Catch ValueError from PDFList.load() for truly malformed lists
- Log warnings instead of raising exceptions for unexpected values
- Continue parsing to extract maximum data from malformed PDFs
- Update dictionary to keep it self-consistent after wrapping lists
Testing:
- Added comprehensive unit tests in tests/test_pdf.py covering:
- Empty lists
- Lists with/without offset information
- Mixed offset scenarios
- Empty lists in dictionaries
- Malformed list values
- Unexpected dictionary values
- Preservation of falsy but valid values (0, False)
- All 8 new tests pass
- All 23 existing unit tests continue to pass
- PDF parsing verified with testdata/javascript.pdf
This builds on the approach from PR #3426 by @mrscottyrose with corrections to logic ordering, offset calculation, and comprehensive test coverage.
Fixes #12
cc: @smoelius
🤖 Generated with Claude Code