PyMuPDF icon indicating copy to clipboard operation
PyMuPDF copied to clipboard

get_pixmap function removes the table and leaves just the content behind

Open anirudhagarwal1 opened this issue 1 year ago • 9 comments

Description of the bug

I have a single page pdf file which has a table inside it. When I load the pdf and try to call the get_pixmap function, it just keeps the content and removes the table around it.

pix = page.get_pixmap(alpha=False, dpi=150) image = Image.open(io.BytesIO(pix.tobytes())) image.save("temp.jpeg", format='jpeg')

Unfortunately, I won't be able to share to share this particular pdf on an open platform, would you be able to suggest how can I further debug it?

Sharing the part of screenshot of this pdf and the converted image. PDF - Screenshot 2024-05-08 at 1 41 06 AM

Image from it - Screenshot 2024-05-08 at 1 42 34 AM

How to reproduce the bug

Seems to be breaking only in this particular kind of PDF. Seems to be working fine elsewhere.

PyMuPDF version

1.24.1

Operating system

MacOS

Python version

3.10

anirudhagarwal1 avatar May 07 '24 20:05 anirudhagarwal1

Providing the example file (not just the pictures) is mandatory for submitting a bug.

JorjMcKie avatar May 07 '24 20:05 JorjMcKie

Since this document contains some sensitive information, I would not able to share it on a public forum. I tried to replicate this issue with multiple other PDFs and wasn't able to.

Would you consider if I could mail it to you privately?

anirudhagarwal1 avatar May 08 '24 07:05 anirudhagarwal1

Since this document contains some sensitive information, I would not able to share it on a public forum. I tried to replicate this issue with multiple other PDFs and wasn't able to.

Would you consider if I could mail it to you privately?

Yes, certainly! Please do use this way.

JorjMcKie avatar May 08 '24 09:05 JorjMcKie

I have shared the same over your github email id - [email protected]

anirudhagarwal1 avatar May 08 '24 12:05 anirudhagarwal1

I have the same issue. When processing a PDF of this paper, the title and table borders were removed. https://arxiv.org/abs/2310.19909 This problem does not occur when using v1.23.26.

mjun0812 avatar May 10 '24 04:05 mjun0812

I have the same issue. When processing a PDF of this paper, the title and table borders were removed. https://arxiv.org/abs/2310.19909 This problem does not occur when using v1.23.26.

Please provide the link to an example PDF / page - I need it to report the bug!

JorjMcKie avatar May 10 '24 08:05 JorjMcKie

@JorjMcKie Sorry, I should have been more explicit. The following URL is the link to the PDF. https://arxiv.org/pdf/2310.19909 Page 1, 4, 7, and 8 borders disappear.

mjun0812 avatar May 10 '24 13:05 mjun0812

Problem file: notext.pdf

MuPDF issue reference: https://bugs.ghostscript.com/show_bug.cgi?id=707840

JorjMcKie avatar Jun 24 '24 12:06 JorjMcKie

@JorjMcKie Sorry, I should have been more explicit. The following URL is the link to the PDF. https://arxiv.org/pdf/2310.19909 Page 1, 4, 7, and 8 borders disappear.

This specific file seems to be no issue (anymore in recent version). The test file above still is a problem.

JorjMcKie avatar Jun 24 '24 12:06 JorjMcKie