pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

Improvement and suggestion

Open diavral opened this issue 1 year ago • 0 comments

I am a new user of PyPDF2 within 24 hours. May it is my problem. I got an error when using extract_text,a suggestion of extract_text and a mistake in document.

Environment

(PDFProcess) E:\pyProject\PDFProcess>python -m platform
Windows-10-10.0.19041-SP0(Windows家庭中文版)
(PDFProcess) E:\pyProject\PDFProcess>python -c "import PyPDF2,sys;print(PyPDF2.__version__,sys.version,sep='###')"
2.10.9###3.9.1 (tags/v3.9.1:1e5d33e, Dec  7 2020, 17:08:21) [MSC v.1927 64 bit (AMD64)]

PDF-> UnicodeCharts

reader = PdfReader("Unicode/CodeCharts_15.0.0.pdf")
page_0 = reader.pages[0]
page_0.extract_text() 

#bug

Location: PyPDF2/generic/_data_structures.py --> class ContentStream(DecodedStreamObject)::__init__ approximately in line 690, code if data[-1] != b"\n": will raise IndexError when data == b"" maybe should change it to if-elif statement:

if len(data) ==0:pass
elif if data[-1] != b"\n": 
    data += b"\n"

or just change to:

if len(data) ==0 or data[-1] != b"\n":
    data += b"\n"

#suggestion

Location:PyPDF2/_page.py --> class PageObject(DictionaryObject)::_extract_text --> function process_operation -->elif operator == b"Tj": approximately in line 1514 ,not sure yet. when I use (fixed) page_num.extract_text() ,I got a String without appropriate separator such as '\n' to break or split lines, I try to add a newline between #fmt: on and else:return None

                # fmt: on
    text+="*LineBreak*"
else:
    return None

It works in pure text page,but have bad performance in other formats like table. I have little knowledge about where is right place to add linebreaks. So,I think it is necessary to add a new argument like def extract_text(sep:str=""): and then implement.

#document

Location:docs/user/reading-pdf-annotations.md --> Attachments The example code has NameError,

attachments = {}
for page in reader.pages:
    if "/Annots" in page:
        for annotation in page["/Annots"]:
            subtype = annot.get_object()["/Subtype"]
            if subtype == "/FileAttachment":
                fileobj = annotobj["/FS"]
                attachments[fileobj["/F"]] = fileobj["/EF"]["/F"].get_data()

for annotation --> subtype = annot --> fileobj = annotobj Variables' name should be uniformed in the above example.

diavral avatar Sep 22 '22 11:09 diavral