pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

Functionality of b_

Open j-t-1 opened this issue 1 year ago • 4 comments

B_CACHE: Dict[Union[str, bytes], bytes] = {}


def b_(s: Union[str, bytes]) -> bytes:
    if isinstance(s, bytes):
        return s
    bc = B_CACHE
    if s in bc:
        return bc[s]
    try:
        r = s.encode("latin-1")
        if len(s) < 2:
            bc[s] = r
        return r
    except Exception:
        r = s.encode("utf-8")
        if len(s) < 2:
            bc[s] = r
        return r

It is non-obvious why we are only caching strings of length zero (the empty string) and of length one.

There is a need to add comments to this function.

j-t-1 avatar Jun 27 '24 08:06 j-t-1

I'm not sure this is at will. you may found a way to improve performances...

pubpub-zz avatar Jun 27 '24 17:06 pubpub-zz

This has been added/changed in #124 by @mozbugbox.

stefan6419846 avatar Jun 28 '24 07:06 stefan6419846

This has been added/changed in #124 by @mozbugbox.

The first of the commit messages is "Speedup b_() for short string for Python3", which suggests the current functionality is deliberate.

The code tries latin-1 and if there is an exception uses utf-8. Given the rise of utf-8, I think we should just use utf-8.

j-t-1 avatar Jun 28 '24 12:06 j-t-1

@j-t-1 I recommend you to further dig into the commit history if unsure about specific cases - the latin-1 workaround has been required for some files. Doing a "blame" shows you that the corresponding change is from #463 and there already has been a request for an example file about two months ago: https://github.com/py-pdf/pypdf/pull/463#issuecomment-2067587535

stefan6419846 avatar Jun 28 '24 12:06 stefan6419846

@pubpub-zz
Line 226 of test_generic.py: # to test latin-1 aka stdencoding

Is this equivalent to this (and if so we could change it)? # to test PDFDocEncoding (latin-1)

j-t-1 avatar Jul 03 '24 10:07 j-t-1

Aka =also known as Your rewording is correct

pubpub-zz avatar Jul 03 '24 14:07 pubpub-zz

@j-t-1 what should we do about his issue ?

pubpub-zz avatar Jul 18 '24 19:07 pubpub-zz

Each time through the function we have an if statement to decide if a string of length zero or one is in the cache. If it is not, we have another if statement deciding whether len(s) < 2. If we remove the cache we would remove at least one if statement every time b_ is invoked.

Keeping it as is, only make sense if:

  1. A lot of calls to b_ satisfy len(s) < 2
  2. s.encode("latin-1") is an expensive operation compared with an if statement and cache retrieval of a string of length 0 or 1.

To gauge 1. depends on particular PDF and use case because b_ is used in a few places. How would we test 2.?

dis.dis("".encode)
TypeError: don't know how to disassemble builtin_function_or_method objects

This comment suggests we keep b_ as is.

j-t-1 avatar Jul 22 '24 09:07 j-t-1

With regard to 1: In my experience, automated reporting frameworks and 'document to PDF' conversion tools LOVE to use a 'render every character one at a time' paradigm for creating a 'justified' text layout, and many of them just behave that way by default to simplify their implementations. Long story short, I suspect the len(s) < 2 is hit far more often than you'd think during text extraction ops...

shartzog avatar Jul 23 '24 02:07 shartzog

@shartzog this is the information needed!

In a future PR we could possibly add your comment above as a comment into b_:

Automated reporting frameworks and 'document to PDF' conversion tools love to use a 'render every character one at a time' paradigm for creating a 'justified' text layout, and many of them just behave that way by default to simplify their implementation.

Referencing you in the commit message.

j-t-1 avatar Jul 24 '24 08:07 j-t-1