pdftotree
pdftotree copied to clipboard
Enhancement using pdf-to-svg to get underlined and struck-out text formatting
I have had good results by converting a pdf to a series of svg (scalable vector graphics; an xml format) files with the open source tool mupdf. I then use an an xml parser (e.g. Beautiful Soup) to combine all of the text, text formatting, text position, page metadata, and document metadata into a pandas DataFrame.
I can create an additional pandas DataFrame with all of the page coordinates of each <path>
from the svg file and combine with the text DataFrame in such a way that I can identify and tag the specific text that was either struck-out or underlined - a critical feature in my use case.
Using numpy to optimize much of these operations, I can generate the final DataFrame for a 150 page all-text pdf with abundant text underlines and strike-outs in about one second on a consumer laptop.
Also it can then be relatively straight forward to construct text formatting features to then base a document hierarchy on those features.
Combinations of the following text-formatting features can be used deduce document hierarchy:
- Case: UPPER > Title Case > Sentence > lower
- Font Size: Large > Small
- Font Weight: Bold > Italic > Normal
- Underline: Underline > No Underline
- Line Spacing: Large > Small
- Alignment: Centered > Left
- Indentation: No Indent > Indent
see https://github.com/HazyResearch/fonduer/issues/111#issue-352528462 for an example pdf.
The mutool draw
command described there converts the pdf to html, but it also misses the abundant strikeouts and underlines and all of the text that is clearly struck-out is shown as regular formatted text in the html output.
I'm interested in your method to convert from a mutool-generated SVG to Fonduer's data model. Are you able to release this code? Thanks!
Let me talk to some people. In the meantime, I could provide some pseudocode outlining the whole process (in more detail than what's above).
Thanks, pseudocode or a bit more detail than above would be helpful! My email is [email protected] if you prefer to chat over email.