papermage icon indicating copy to clipboard operation
papermage copied to clipboard

Blank pages in pdf lead to the wrong number of pages

Open niuzaisheng opened this issue 4 months ago • 1 comments

I was dealing with a document triggered this error in papermage/rasterizers/rasterizer.py:

raise ValueError(f"Failed to attach. {len(images)} images != {len(pages)} pages in doc.")

I did a deep debug found that the reason is my pdf has a blank page, and this code, in papermage/parsers/pdfplumber_parser.py, to determine the number of pages is by traversing the existence of all the objects, which will skip the blank page, resulting in the number of page objects in page_annos list to be less than the actual number of pages.

https://github.com/allenai/papermage/blob/6a0a4a2fbb9dc5b1503afe2301a937405b504cb1/papermage/parsers/pdfplumber_parser.py#L338

        for page_id, tups in itertools.groupby(iterable=tokens_with_group_ids, key=lambda tup: tup[2]):
            page_tokens = [token for token, _, _ in tups]
            page_w, page_h, page_unit = dims[page_id]
            page = Entity(
                spans=[
                    Span(
                        start=page_tokens[0].spans[0].start,
                        end=page_tokens[-1].spans[0].end,
                    )
                ],
                boxes=[Box.create_enclosing_box(boxes=[box for t in page_tokens for box in t.boxes])],
                metadata=Metadata(width=page_w, height=page_h, user_unit=page_unit),
            )
            page_annos.append(page)

Some further modifications may be needed here to deal with this rare case. Thank you.

niuzaisheng avatar Feb 26 '24 17:02 niuzaisheng