papermage
papermage copied to clipboard
Blank pages in pdf lead to the wrong number of pages
I was dealing with a document triggered this error in papermage/rasterizers/rasterizer.py
:
raise ValueError(f"Failed to attach. {len(images)} images
!= {len(pages)} pages in doc
.")
I did a deep debug found that the reason is my pdf has a blank page, and this code, in papermage/parsers/pdfplumber_parser.py
, to determine the number of pages is by traversing the existence of all the objects, which will skip the blank page, resulting in the number of page objects in page_annos
list to be less than the actual number of pages.
https://github.com/allenai/papermage/blob/6a0a4a2fbb9dc5b1503afe2301a937405b504cb1/papermage/parsers/pdfplumber_parser.py#L338
for page_id, tups in itertools.groupby(iterable=tokens_with_group_ids, key=lambda tup: tup[2]):
page_tokens = [token for token, _, _ in tups]
page_w, page_h, page_unit = dims[page_id]
page = Entity(
spans=[
Span(
start=page_tokens[0].spans[0].start,
end=page_tokens[-1].spans[0].end,
)
],
boxes=[Box.create_enclosing_box(boxes=[box for t in page_tokens for box in t.boxes])],
metadata=Metadata(width=page_w, height=page_h, user_unit=page_unit),
)
page_annos.append(page)
Some further modifications may be needed here to deal with this rare case. Thank you.