pypdf
pypdf copied to clipboard
Functionality of b_
B_CACHE: Dict[Union[str, bytes], bytes] = {}
def b_(s: Union[str, bytes]) -> bytes:
if isinstance(s, bytes):
return s
bc = B_CACHE
if s in bc:
return bc[s]
try:
r = s.encode("latin-1")
if len(s) < 2:
bc[s] = r
return r
except Exception:
r = s.encode("utf-8")
if len(s) < 2:
bc[s] = r
return r
It is non-obvious why we are only caching strings of length zero (the empty string) and of length one.
There is a need to add comments to this function.
I'm not sure this is at will. you may found a way to improve performances...
This has been added/changed in #124 by @mozbugbox.
This has been added/changed in #124 by @mozbugbox.
The first of the commit messages is "Speedup b_() for short string for Python3", which suggests the current functionality is deliberate.
The code tries latin-1 and if there is an exception uses utf-8. Given the rise of utf-8, I think we should just use utf-8.
@j-t-1 I recommend you to further dig into the commit history if unsure about specific cases - the latin-1 workaround has been required for some files. Doing a "blame" shows you that the corresponding change is from #463 and there already has been a request for an example file about two months ago: https://github.com/py-pdf/pypdf/pull/463#issuecomment-2067587535
@pubpub-zz
Line 226 of test_generic.py:
# to test latin-1 aka stdencoding
Is this equivalent to this (and if so we could change it)?
# to test PDFDocEncoding (latin-1)
Aka =also known as Your rewording is correct
@j-t-1 what should we do about his issue ?
Each time through the function we have an if statement to decide if a string of length zero or one is in the cache. If it is not, we have another if statement deciding whether len(s) < 2. If we remove the cache we would remove at least one if statement every time b_ is invoked.
Keeping it as is, only make sense if:
- A lot of calls to b_ satisfy len(s) < 2
- s.encode("latin-1") is an expensive operation compared with an if statement and cache retrieval of a string of length 0 or 1.
To gauge 1. depends on particular PDF and use case because b_ is used in a few places. How would we test 2.?
dis.dis("".encode)
TypeError: don't know how to disassemble builtin_function_or_method objects
This comment suggests we keep b_ as is.
With regard to 1: In my experience, automated reporting frameworks and 'document to PDF' conversion tools LOVE to use a 'render every character one at a time' paradigm for creating a 'justified' text layout, and many of them just behave that way by default to simplify their implementations. Long story short, I suspect the len(s) < 2 is hit far more often than you'd think during text extraction ops...
@shartzog this is the information needed!
In a future PR we could possibly add your comment above as a comment into b_:
Automated reporting frameworks and 'document to PDF' conversion tools love to use a 'render every character one at a time' paradigm for creating a 'justified' text layout, and many of them just behave that way by default to simplify their implementation.
Referencing you in the commit message.