PyMuPDF icon indicating copy to clipboard operation
PyMuPDF copied to clipboard

Access Violation Error

Open Amitdedhia6 opened this issue 2 years ago • 17 comments

My requirement is to remove text from the PDF (i.e. keep images, drawings and rest things as is). Here is the code snippet:

    def remove_text_from_pdf(pdf_filepath: str):
        output_pdf_filepath = pdf_filepath[:-4] + "-text_removed.pdf"
        shutil.copy(pdf_filepath, output_pdf_filepath)
        try:
            doc = fitz.Document(output_pdf_filepath)
            removals = False
            for page in doc:
                found = False
                text_page_blocks = page.get_textpage().extractBLOCKS()
                for block in text_page_blocks:
                    is_text = block[-1] == 0
                    if is_text:
                        found = True
                        bbox = block[:4]
                        pt1 = Point(bbox[0], bbox[1])
                        pt2 = Point(bbox[2], bbox[3])
                        rect = Rect(pt1, pt2)
                        page.add_redact_annot(rect)

                if found:
                    page._apply_redactions()
                    removals = True
            if removals:
                doc.saveIncr()
            return output_pdf_filepath
        except Exception as e:
            os.remove(output_pdf_filepath)
            return ""

This works for many documents. However, for some documents it throws Access Violation Error at page.apply_redactions() line. The code execution simply ends with message "Process finished with exit code -1073741819 (0xC0000005)". What is wrong here? Is there any known issue? Any suggestions?

I cannot attach the PDF here due to privacy. I am using Windows 10, Python 3.9.

Amitdedhia6 avatar Aug 10 '22 07:08 Amitdedhia6

We need all data for reproducing the error as said in the bug report template - which you did not use. Among them the full version info of PyMuPDF.

JorjMcKie avatar Aug 10 '22 07:08 JorjMcKie

Sorry about that.


Python version 3.9.12 (tags/v3.9.12:b28265d, Mar 23 2022, 23:52:46) [MSC v.1929 64 bit (AMD64)] 
Platform win32 
 
PyMuPDF 1.20.0: Python bindings for the MuPDF 1.20.1 library.
Version date: 2022-06-27 00:00:01.
Built for Python 3.9 on win32 (64-bit).

I will check with my office if I can share the PDF. Will come back in a day.

Amitdedhia6 avatar Aug 10 '22 08:08 Amitdedhia6

Have you confirmed that you issue is not explainable with #1824?

JorjMcKie avatar Aug 10 '22 12:08 JorjMcKie

IAW: handle images differently - you are currently allowing, that overlapping images are modified - which segfaults if an image happens to be transparent. Use option PDF_REDACT_IMAGES_NONE or PDF_REDACT_IMAGES_REMOVE.

JorjMcKie avatar Aug 10 '22 12:08 JorjMcKie

Using option PDF_REDACT_IMAGES_NONE or PDF_REDACT_IMAGES_REMOVE did not work. Tried like this: page.apply_redactions(images=PDF_REDACT_IMAGE_NONE)

However, the original code in my yesterday's post works fine with PyMuPdf version 1.19.6

So not sure if this issue is duplicate of #1824, or another related issue.

I am attaching a sample document here where the issue can be reproduced on v1.20 even after passing parameter PDF_REDACT_IMAGES_NONE new.pdf

Amitdedhia6 avatar Aug 11 '22 03:08 Amitdedhia6

I did not use the unnecessarily complex code above, but this one:

page.add_redact_annot(page.rect)
'Redact' annotation on page 0 of new.pdf
page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_REMOVE)
doc.ez_save("new-remove.pdf")

Likewise PDF_REDACT_IMAGE_NONE, both worked fine.

JorjMcKie avatar Aug 11 '22 06:08 JorjMcKie

But this code also works fine:

for b in page.get_text("blocks"):
    if b[-1] == 1:
        continue
    page.add_redact_annot(b[:4])
page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_NONE)

JorjMcKie avatar Aug 11 '22 06:08 JorjMcKie

So I cannot reproduce your problem.

JorjMcKie avatar Aug 11 '22 07:08 JorjMcKie

STOP: I made an error using the right PyMuPDF version - There is one more bug in the most current version which will be fixed in the next one.

JorjMcKie avatar Aug 11 '22 07:08 JorjMcKie

This preliminary version should work: PyMuPDF-1.20.2-cp39-cp39-win_amd64.zip

JorjMcKie avatar Aug 11 '22 07:08 JorjMcKie

duplicate of #1824

JorjMcKie avatar Aug 11 '22 11:08 JorjMcKie

New release PyMuPDF-1.20.2 fixes #1824, so please try with pip install --upgrade pymupdf.

The issue did not occur with the document I had attached in this conversation previously. However, that document was generated by taking out one page from the original document and editing some text for privacy purpose. The issue still occurs with the original document. I will try to generate an anonymized copy of the original document where the issue can reproduce. Give me some time.

Amitdedhia6 avatar Aug 16 '22 02:08 Amitdedhia6

Okay here is the new document. This one results in access violation error. new_20220816.pdf

Environment details:

PyMuPDF 1.20.2: Python bindings for the MuPDF 1.20.3 library.
Version date: 2022-08-13 00:00:01.
Built for Python 3.9 on win32 (64-bit).

Amitdedhia6 avatar Aug 16 '22 04:08 Amitdedhia6

Confirmed, the problem posed by this PDF is not resolved by the latest PyMuPDF / MuPDF version (1.20.3).

The MuPDF version in development however does resolve it. So you will have to wait for the next version.

JorjMcKie avatar Aug 16 '22 06:08 JorjMcKie

Okay, thank you for the information. When is the next version of MuPDF likely to release? Is there any tentative date?

Amitdedhia6 avatar Aug 16 '22 07:08 Amitdedhia6

I don't know - presumably later this year.

JorjMcKie avatar Aug 16 '22 07:08 JorjMcKie

I have the same issue with following codes

import fitz

doc = fitz.open(fp)
page = doc[0]
page.clean_contents()

PyMuPDF 1.20.2: Python bindings for the MuPDF 1.20.3 library. Version date: 2022-08-13 00:00:01.

Pandaaaa906 avatar Oct 10 '22 02:10 Pandaaaa906

Fixed in 1.21.0