pypdf
pypdf copied to clipboard
More flexible remove_annots_from_page function
I would like to dynamically remove certain annotations from a page but not others. I solved it like this:
from pypdf import PageObject, PdfWriter, PdfReader
from pypdf.constants import PageAttributes as PG
from pypdf.generic import NullObject, IndirectObject, ArrayObject, DictionaryObject
from typing import cast, Union, Optional, Callable
class MyPdfWriter(PdfWriter):
"""
Remove annotations by custom delete_decide_function.
Args:
delete_decide_function: Function that takes two arguments,
ArrayObject and DictionaryObject, and decides whether to remove
them from the page. For example:
def is_google_link(an: ArrayObject, obj: DictionaryObject) -> bool:
try:
uri = obj['/A']['/URI']
return uri.startswith('https://google.com/')
except KeyError:
return False
"""
def remove_annots_from_page(
self,
page: Union[IndirectObject, PageObject, DictionaryObject],
delete_decide_function: Optional[Callable] = None
) -> None:
# based on https://github.com/py-pdf/pypdf/blob/3de03b75bc6c63e97dc682428eac8e4e8aa9276c/pypdf/_writer.py#L1922
page = cast(DictionaryObject, page.get_object())
if PG.ANNOTS in page:
i = 0
while i < len(cast(ArrayObject, page[PG.ANNOTS])):
an = cast(ArrayObject, page[PG.ANNOTS])[i]
obj = cast(DictionaryObject, an.get_object())
if delete_decide_function is None or delete_decide_function(an, obj):
if isinstance(an, IndirectObject):
self._objects[an.idnum - 1] = NullObject() # to reduce PDF size
del page[PG.ANNOTS][i] # type:ignore
else:
i += 1
def is_sciwheel(an: ArrayObject, obj: DictionaryObject) -> bool:
try:
uri = obj['/A']['/URI']
return uri.startswith('https://sciwheel.com/')
except KeyError:
return False
def remove_pdf_links(in_pdf, out_pdf):
pdf = MyPdfWriter(clone_from=in_pdf)
for page in pdf.pages:
# print first line of page
print(page.extract_text().split('\n')[0])
# remove sciwheel.com hyperlinks from page
# new_pdf._remove_annots_from_page(page, subtypes=("/Link",) )
pdf.remove_annots_from_page(page, is_sciwheel)
pdf.write(out_pdf)
I thought my remove_annots_from_page
function is superior to the existing _remove_annots_from_page
, so I thought I'd share it.
@MrTomRod
I would recommend you first to open directly the pdf into a PdfWriter
objected using clone_from
parameter
Once loaded in, you will be able to remove the annotations you want. you should have a look at https://pypdf.readthedocs.io/en/stable/_modules/pypdf/_writer.html#PdfWriter.remove_links for inspiration
I don't think the existing API enables me to do what I want, i.e., to remove only certain hyperlinks, namely those that start with https://sciwheel.com/
.
Thanks for the clone_from
hint, it's much cleaner now. I adapted the code above.
I agree that the existing functions may not be adequate for you but you should copy and then adjust _remove_annots_from_page()
as you wish.
Yes, I already managed to do what I wanted. The point is that my solution is more flexible and would imo improve your library.
Feel free to close this issue if you don't think it's a good suggestion.
If you think you can propose a solution that will improve pypdf feel free to propose a PR