pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

More flexible remove_annots_from_page function

Open MrTomRod opened this issue 1 year ago • 5 comments

I would like to dynamically remove certain annotations from a page but not others. I solved it like this:

from pypdf import PageObject, PdfWriter, PdfReader
from pypdf.constants import PageAttributes as PG
from pypdf.generic import NullObject, IndirectObject, ArrayObject, DictionaryObject
from typing import cast, Union, Optional, Callable


class MyPdfWriter(PdfWriter):
    """
    Remove annotations by custom delete_decide_function.

    Args:
        delete_decide_function: Function that takes two arguments, 
        ArrayObject and DictionaryObject, and decides whether to remove
        them from the page. For example:
            def is_google_link(an: ArrayObject, obj: DictionaryObject) -> bool:
                try:
                    uri = obj['/A']['/URI']
                    return uri.startswith('https://google.com/')
                except KeyError:
                    return False
    """
    def remove_annots_from_page(
            self,
            page: Union[IndirectObject, PageObject, DictionaryObject],
            delete_decide_function: Optional[Callable] = None
    ) -> None:
        # based on https://github.com/py-pdf/pypdf/blob/3de03b75bc6c63e97dc682428eac8e4e8aa9276c/pypdf/_writer.py#L1922
        page = cast(DictionaryObject, page.get_object())
        if PG.ANNOTS in page:
            i = 0
            while i < len(cast(ArrayObject, page[PG.ANNOTS])):
                an = cast(ArrayObject, page[PG.ANNOTS])[i]
                obj = cast(DictionaryObject, an.get_object())
                if delete_decide_function is None or delete_decide_function(an, obj):
                    if isinstance(an, IndirectObject):
                        self._objects[an.idnum - 1] = NullObject()  # to reduce PDF size
                    del page[PG.ANNOTS][i]  # type:ignore
                else:
                    i += 1


def is_sciwheel(an: ArrayObject, obj: DictionaryObject) -> bool:
    try:
        uri = obj['/A']['/URI']
        return uri.startswith('https://sciwheel.com/')
    except KeyError:
        return False


def remove_pdf_links(in_pdf, out_pdf):
    pdf = MyPdfWriter(clone_from=in_pdf)

    for page in pdf.pages:
        # print first line of page
        print(page.extract_text().split('\n')[0])

        # remove sciwheel.com hyperlinks from page
        # new_pdf._remove_annots_from_page(page, subtypes=("/Link",) )
        pdf.remove_annots_from_page(page, is_sciwheel)

    pdf.write(out_pdf)

I thought my remove_annots_from_page function is superior to the existing _remove_annots_from_page, so I thought I'd share it.

MrTomRod avatar May 03 '23 08:05 MrTomRod

@MrTomRod I would recommend you first to open directly the pdf into a PdfWriter objected using clone_from parameter Once loaded in, you will be able to remove the annotations you want. you should have a look at https://pypdf.readthedocs.io/en/stable/_modules/pypdf/_writer.html#PdfWriter.remove_links for inspiration

pubpub-zz avatar May 03 '23 08:05 pubpub-zz

I don't think the existing API enables me to do what I want, i.e., to remove only certain hyperlinks, namely those that start with https://sciwheel.com/.

Thanks for the clone_from hint, it's much cleaner now. I adapted the code above.

MrTomRod avatar May 03 '23 08:05 MrTomRod

I agree that the existing functions may not be adequate for you but you should copy and then adjust _remove_annots_from_page() as you wish.

pubpub-zz avatar May 03 '23 10:05 pubpub-zz

Yes, I already managed to do what I wanted. The point is that my solution is more flexible and would imo improve your library.

Feel free to close this issue if you don't think it's a good suggestion.

MrTomRod avatar May 03 '23 11:05 MrTomRod

If you think you can propose a solution that will improve pypdf feel free to propose a PR

pubpub-zz avatar May 03 '23 18:05 pubpub-zz