pypdf
pypdf copied to clipboard
ENH: Addition of optional visitor-functions in extract_text()
This request contains the addition of optional visitor-callbacks in extract_text(). While scanning the text-objects of a page _extract_text() calls these visitor-methods. So one can analyze the operations in the page and the positions of the texts.
As an example I added a test in tests/test_page.py which extracts the texts of labels in a Figure.
I appreciate that PyPDF2 can analyze a PDF file without external libraries :-).
With best regards Sascha Rogmann
Thank you for the contribution :heart:
I didn't expect that we could get text-tokens and their positions in the document in such a rather easy extension. Nice!
I still need to think about this PR / check if there is a performance impact and look at it from a maintenance perspective.
In the meantime, would you mind running black .
? You need pip install black
; it's a code-formatter that fixes all of the Flake8 issues.
I executed black, it reformatted my changes in _page.py and test_page.py :-).
@srogmann, some extra parameters to be returned to the functions could be useful for some filterig : BaseFont Name and Size (rescaled to the page) ; this would be useful for title extraction for example
@pubpub-zz I added the font-dictionary and the font-size in the text-visitor-function. I added the reference to the font-dictionary instead of the BaseFont because I didn't know what might be of further interest.
def print_visi(text, cm_matrix, tm_matrix, font_dict, font_size):
if text.strip() != "":
listTexts.append(
PositionedText(
text, tm_matrix[4], tm_matrix[5], font_dict, font_size
)
)
[...]
# Check the fonts. We check: /F2 9.96 Tf [...] [(Dat)-2(e)] TJ
textDatOfDate = listRows[0][0][0]
assert textDatOfDate.font_dict is not None
assert textDatOfDate.font_dict["/Name"] == "/F2"
assert textDatOfDate.font_dict["/BaseFont"] == "/Arial,Bold"
assert textDatOfDate.font_dict["/Encoding"] == "/WinAnsiEncoding"
assert textDatOfDate.font_size == 9.96`
@pubpub-zz One could add helper classes like PositionedText to support parsing of formatted texts. I used tests/test_page.py as some kind of inkubator ;-).
assert textDat.get_base_font() == "/Arial,Bold"
@srogmann What is the state of this PR? Do you need help to resolve the merge conflicts / the failing test?
Besides those, is the PR ready in your opinion?
@MartinThoma In my opinion the PR was ready 22 days ago. I will have a look at the current state.
I'm sorry for the delay; I thought there still was something to be done :see_no_evil:
@MartinThoma I merged and resolved conflicts. The PR should work. You may have a look at it.
Each change of the output-result in _extract_text requires a visitor-call in _extract_text:
if visitor_text is not None:
visitor_text(text, cm_matrix, tm_matrix, cmap[3], font_size)
There are quite a lot of variables describing the current state, cmap[3] contains the font-dictionary. Perhaps in future there will be one object describing the state.
In tests/test_page.py there are some functions which might be of interest in PyPDF2 or pdfly in future to support using a visitor. For example one might try to create a svg file:
def exportSvgFile(listTexts, listRects, fileName):
import svgwrite
dwg = svgwrite.Drawing(fileName, profile="tiny")
color = svgwrite.rgb(255, 0, 0, "%")
for r in listRects:
dwg.add(dwg.rect((r.x, r.y), (r.w, r.h), stroke=color, fill_opacity=0.05))
for t in listTexts:
dwg.add(dwg.text(t.text, insert=(t.x, t.y), fill="blue"))
dwg.save()
@MartinThoma I added DictionaryObject in cmaps (my last commit-comment is wrong).
Codecov Report
Base: 94.53% // Head: 94.10% // Decreases project coverage by -0.43%
:warning:
Coverage data is based on head (
1969c9f
) compared to base (2845c6d
). Patch coverage: 35.13% of modified lines in pull request are covered.
Additional details and impacted files
@@ Coverage Diff @@
## main #1252 +/- ##
==========================================
- Coverage 94.53% 94.10% -0.44%
==========================================
Files 28 28
Lines 5035 5068 +33
Branches 1035 1051 +16
==========================================
+ Hits 4760 4769 +9
- Misses 165 177 +12
- Partials 110 122 +12
Impacted Files | Coverage Δ | |
---|---|---|
PyPDF2/_cmap.py | 95.08% <ø> (ø) |
|
PyPDF2/_page.py | 91.67% <35.13%> (-3.46%) |
:arrow_down: |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
@srogmann Thank you for your contribution! If you want, I can add you to https://pypdf2.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-)
There are quite a lot of variables describing the current state, cmap[3] contains the font-dictionary. Perhaps in future there will be one object describing the state.
Yes, I think that would make sense. Also getting rid of the use of global / non-local variables and passing data explicitly around might help.
@pubpub-zz Did you have a look? What do you think about the changes?
If you're good with them as well, I would merge + release :-)
This sounds good. I had some request earlier that have been fullfiled. I agree that it is time to release it for user feedbacks
@srogmann Very nice work :partying_face:
I think this has a lot of potential, but it's also pretty hard to use. We will need to add documentation and typical examples. I'm super excited to see how people will use it :tada:
@MartinThoma Thanks for merging!
An additation to CONTRIBUTORS.html would be fine.
I think this has a lot of potential, but it's also pretty hard to use. We will need to add documentation and typical examples.
In tests/test/page.py there a two util-classes PositionedText and Rectangle. After renaming they might be useful when one wants to write an own visitor. Documentation and typical examples would be nice. But docs/user/ is contained in the repository, too, so I can think about another pull-request containing some documentation and examples (e.g. the util-classes mentioned and a sample to extract tables or to ignore page-headers).