PyMuPDF
PyMuPDF copied to clipboard
Access Violation Error
My requirement is to remove text from the PDF (i.e. keep images, drawings and rest things as is). Here is the code snippet:
def remove_text_from_pdf(pdf_filepath: str):
output_pdf_filepath = pdf_filepath[:-4] + "-text_removed.pdf"
shutil.copy(pdf_filepath, output_pdf_filepath)
try:
doc = fitz.Document(output_pdf_filepath)
removals = False
for page in doc:
found = False
text_page_blocks = page.get_textpage().extractBLOCKS()
for block in text_page_blocks:
is_text = block[-1] == 0
if is_text:
found = True
bbox = block[:4]
pt1 = Point(bbox[0], bbox[1])
pt2 = Point(bbox[2], bbox[3])
rect = Rect(pt1, pt2)
page.add_redact_annot(rect)
if found:
page._apply_redactions()
removals = True
if removals:
doc.saveIncr()
return output_pdf_filepath
except Exception as e:
os.remove(output_pdf_filepath)
return ""
This works for many documents. However, for some documents it throws Access Violation Error at page.apply_redactions()
line. The code execution simply ends with message "Process finished with exit code -1073741819 (0xC0000005)"
. What is wrong here? Is there any known issue? Any suggestions?
I cannot attach the PDF here due to privacy. I am using Windows 10, Python 3.9.
We need all data for reproducing the error as said in the bug report template - which you did not use. Among them the full version info of PyMuPDF.
Sorry about that.
Python version 3.9.12 (tags/v3.9.12:b28265d, Mar 23 2022, 23:52:46) [MSC v.1929 64 bit (AMD64)]
Platform win32
PyMuPDF 1.20.0: Python bindings for the MuPDF 1.20.1 library.
Version date: 2022-06-27 00:00:01.
Built for Python 3.9 on win32 (64-bit).
I will check with my office if I can share the PDF. Will come back in a day.
Have you confirmed that you issue is not explainable with #1824?
IAW: handle images differently - you are currently allowing, that overlapping images are modified - which segfaults if an image happens to be transparent.
Use option PDF_REDACT_IMAGES_NONE
or PDF_REDACT_IMAGES_REMOVE
.
Using option PDF_REDACT_IMAGES_NONE or PDF_REDACT_IMAGES_REMOVE did not work. Tried like this: page.apply_redactions(images=PDF_REDACT_IMAGE_NONE)
However, the original code in my yesterday's post works fine with PyMuPdf version 1.19.6
So not sure if this issue is duplicate of #1824, or another related issue.
I am attaching a sample document here where the issue can be reproduced on v1.20 even after passing parameter PDF_REDACT_IMAGES_NONE new.pdf
I did not use the unnecessarily complex code above, but this one:
page.add_redact_annot(page.rect)
'Redact' annotation on page 0 of new.pdf
page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_REMOVE)
doc.ez_save("new-remove.pdf")
Likewise PDF_REDACT_IMAGE_NONE
, both worked fine.
But this code also works fine:
for b in page.get_text("blocks"):
if b[-1] == 1:
continue
page.add_redact_annot(b[:4])
page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_NONE)
So I cannot reproduce your problem.
STOP: I made an error using the right PyMuPDF version - There is one more bug in the most current version which will be fixed in the next one.
This preliminary version should work: PyMuPDF-1.20.2-cp39-cp39-win_amd64.zip
duplicate of #1824
New release PyMuPDF-1.20.2 fixes #1824, so please try with pip install --upgrade pymupdf
.
The issue did not occur with the document I had attached in this conversation previously. However, that document was generated by taking out one page from the original document and editing some text for privacy purpose. The issue still occurs with the original document. I will try to generate an anonymized copy of the original document where the issue can reproduce. Give me some time.
Okay here is the new document. This one results in access violation error. new_20220816.pdf
Environment details:
PyMuPDF 1.20.2: Python bindings for the MuPDF 1.20.3 library.
Version date: 2022-08-13 00:00:01.
Built for Python 3.9 on win32 (64-bit).
Confirmed, the problem posed by this PDF is not resolved by the latest PyMuPDF / MuPDF version (1.20.3).
The MuPDF version in development however does resolve it. So you will have to wait for the next version.
Okay, thank you for the information. When is the next version of MuPDF likely to release? Is there any tentative date?
I don't know - presumably later this year.
I have the same issue with following codes
import fitz
doc = fitz.open(fp)
page = doc[0]
page.clean_contents()
PyMuPDF 1.20.2: Python bindings for the MuPDF 1.20.3 library. Version date: 2022-08-13 00:00:01.
Fixed in 1.21.0