ENH: Addition of optional visitor-functions in extract_text()

Open srogmann opened this issue 1 year ago • 10 comments

This request contains the addition of optional visitor-callbacks in extract_text(). While scanning the text-objects of a page _extract_text() calls these visitor-methods. So one can analyze the operations in the page and the positions of the texts.

As an example I added a test in tests/test_page.py which extracts the texts of labels in a Figure.

I appreciate that PyPDF2 can analyze a PDF file without external libraries :-).

With best regards Sascha Rogmann

Aug 18 '22 21:08 srogmann

Thank you for the contribution :heart:

I didn't expect that we could get text-tokens and their positions in the document in such a rather easy extension. Nice!

I still need to think about this PR / check if there is a performance impact and look at it from a maintenance perspective.

In the meantime, would you mind running black .? You need pip install black; it's a code-formatter that fixes all of the Flake8 issues.

Aug 19 '22 05:08 MartinThoma

I executed black, it reformatted my changes in _page.py and test_page.py :-).

Aug 19 '22 11:08 srogmann

@srogmann, some extra parameters to be returned to the functions could be useful for some filterig : BaseFont Name and Size (rescaled to the page) ; this would be useful for title extraction for example

Aug 21 '22 07:08 pubpub-zz

@pubpub-zz I added the font-dictionary and the font-size in the text-visitor-function. I added the reference to the font-dictionary instead of the BaseFont because I didn't know what might be of further interest.

    def print_visi(text, cm_matrix, tm_matrix, font_dict, font_size):
        if text.strip() != "":
            listTexts.append(
                PositionedText(
                    text, tm_matrix[4], tm_matrix[5], font_dict, font_size
                )
            )

[...]

# Check the fonts. We check: /F2 9.96 Tf [...] [(Dat)-2(e)] TJ
textDatOfDate = listRows[0][0][0]
assert textDatOfDate.font_dict is not None
assert textDatOfDate.font_dict["/Name"] == "/F2"
assert textDatOfDate.font_dict["/BaseFont"] == "/Arial,Bold"
assert textDatOfDate.font_dict["/Encoding"] == "/WinAnsiEncoding"
assert textDatOfDate.font_size == 9.96`

Aug 22 '22 23:08 srogmann

@pubpub-zz One could add helper classes like PositionedText to support parsing of formatted texts. I used tests/test_page.py as some kind of inkubator ;-).

assert textDat.get_base_font() == "/Arial,Bold"

Aug 23 '22 11:08 srogmann

@srogmann What is the state of this PR? Do you need help to resolve the merge conflicts / the failing test?

Besides those, is the PR ready in your opinion?

Sep 14 '22 04:09 MartinThoma

@MartinThoma In my opinion the PR was ready 22 days ago. I will have a look at the current state.

Sep 14 '22 10:09 srogmann

I'm sorry for the delay; I thought there still was something to be done :see_no_evil:

Sep 14 '22 12:09 MartinThoma

@MartinThoma I merged and resolved conflicts. The PR should work. You may have a look at it.

Each change of the output-result in _extract_text requires a visitor-call in _extract_text:

                    if visitor_text is not None:
                        visitor_text(text, cm_matrix, tm_matrix, cmap[3], font_size)

There are quite a lot of variables describing the current state, cmap[3] contains the font-dictionary. Perhaps in future there will be one object describing the state.

In tests/test_page.py there are some functions which might be of interest in PyPDF2 or pdfly in future to support using a visitor. For example one might try to create a svg file:

    def exportSvgFile(listTexts, listRects, fileName):
        import svgwrite

        dwg = svgwrite.Drawing(fileName, profile="tiny")
        color = svgwrite.rgb(255, 0, 0, "%")
        for r in listRects:
            dwg.add(dwg.rect((r.x, r.y), (r.w, r.h), stroke=color, fill_opacity=0.05))
        for t in listTexts:
           dwg.add(dwg.text(t.text, insert=(t.x, t.y), fill="blue"))
        dwg.save()

Sep 14 '22 20:09 srogmann

@MartinThoma I added DictionaryObject in cmaps (my last commit-comment is wrong).

Sep 24 '22 19:09 srogmann

Codecov Report

Base: 94.53% // Head: 94.10% // Decreases project coverage by -0.43% :warning:

Coverage data is based on head (1969c9f) compared to base (2845c6d). Patch coverage: 35.13% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1252      +/-   ##
==========================================
- Coverage   94.53%   94.10%   -0.44%     
==========================================
  Files          28       28              
  Lines        5035     5068      +33     
  Branches     1035     1051      +16     
==========================================
+ Hits         4760     4769       +9     
- Misses        165      177      +12     
- Partials      110      122      +12

Impacted Files	Coverage Δ
PyPDF2/_cmap.py	`95.08% <ø> (ø)`
PyPDF2/_page.py	`91.67% <35.13%> (-3.46%)`	:arrow_down:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

Sep 24 '22 19:09 codecov[bot]

@srogmann Thank you for your contribution! If you want, I can add you to https://pypdf2.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-)

Sep 25 '22 06:09 MartinThoma

There are quite a lot of variables describing the current state, cmap[3] contains the font-dictionary. Perhaps in future there will be one object describing the state.

Yes, I think that would make sense. Also getting rid of the use of global / non-local variables and passing data explicitly around might help.

Sep 25 '22 06:09 MartinThoma

@pubpub-zz Did you have a look? What do you think about the changes?

If you're good with them as well, I would merge + release :-)

Sep 25 '22 06:09 MartinThoma

This sounds good. I had some request earlier that have been fullfiled. I agree that it is time to release it for user feedbacks

Sep 25 '22 07:09 pubpub-zz

@srogmann Very nice work :partying_face:

I think this has a lot of potential, but it's also pretty hard to use. We will need to add documentation and typical examples. I'm super excited to see how people will use it :tada:

Sep 25 '22 08:09 MartinThoma

@MartinThoma Thanks for merging!

An additation to CONTRIBUTORS.html would be fine.

I think this has a lot of potential, but it's also pretty hard to use. We will need to add documentation and typical examples.

In tests/test/page.py there a two util-classes PositionedText and Rectangle. After renaming they might be useful when one wants to write an own visitor. Documentation and typical examples would be nice. But docs/user/ is contained in the repository, too, so I can think about another pull-request containing some documentation and examples (e.g. the util-classes mentioned and a sample to extract tables or to ignore page-headers).

Sep 25 '22 13:09 srogmann

pypdf pypdf copied to clipboard

ENH: Addition of optional visitor-functions in extract_text()

Codecov Report

pypdf
pypdf copied to clipboard