pypdf
pypdf copied to clipboard
Improvement and suggestion
I am a new user of PyPDF2 within 24 hours. May it is my problem. I got an error when using extract_text,a suggestion of extract_text and a mistake in document.
Environment
(PDFProcess) E:\pyProject\PDFProcess>python -m platform
Windows-10-10.0.19041-SP0(Windows家庭中文版)
(PDFProcess) E:\pyProject\PDFProcess>python -c "import PyPDF2,sys;print(PyPDF2.__version__,sys.version,sep='###')"
2.10.9###3.9.1 (tags/v3.9.1:1e5d33e, Dec 7 2020, 17:08:21) [MSC v.1927 64 bit (AMD64)]
PDF-> UnicodeCharts
reader = PdfReader("Unicode/CodeCharts_15.0.0.pdf")
page_0 = reader.pages[0]
page_0.extract_text()
#bug
Location: PyPDF2/generic/_data_structures.py --> class ContentStream(DecodedStreamObject)::__init__
approximately in line 690, code if data[-1] != b"\n":
will raise IndexError
when data == b""
maybe should change it to if-elif statement:
if len(data) ==0:pass
elif if data[-1] != b"\n":
data += b"\n"
or just change to:
if len(data) ==0 or data[-1] != b"\n":
data += b"\n"
#suggestion
Location:PyPDF2/_page.py --> class PageObject(DictionaryObject)::_extract_text
--> function process_operation
-->elif operator == b"Tj"
:
approximately in line 1514 ,not sure yet.
when I use (fixed) page_num.extract_text() ,I got a String without appropriate separator such as '\n' to break or split lines,
I try to add a newline between #fmt: on
and else:return None
# fmt: on
text+="*LineBreak*"
else:
return None
It works in pure text page,but have bad performance in other formats like table.
I have little knowledge about where is right place to add linebreaks.
So,I think it is necessary to add a new argument like def extract_text(sep:str=""):
and then implement.
#document
Location:docs/user/reading-pdf-annotations.md --> Attachments The example code has NameError,
attachments = {}
for page in reader.pages:
if "/Annots" in page:
for annotation in page["/Annots"]:
subtype = annot.get_object()["/Subtype"]
if subtype == "/FileAttachment":
fileobj = annotobj["/FS"]
attachments[fileobj["/F"]] = fileobj["/EF"]["/F"].get_data()
for annotation
--> subtype = annot
--> fileobj = annotobj
Variables' name should be uniformed in the above example.