pdftotree icon indicating copy to clipboard operation
pdftotree copied to clipboard

Enhancement using pdf-to-svg to get underlined and struck-out text formatting

Open clayms opened this issue 6 years ago • 4 comments

I have had good results by converting a pdf to a series of svg (scalable vector graphics; an xml format) files with the open source tool mupdf. I then use an an xml parser (e.g. Beautiful Soup) to combine all of the text, text formatting, text position, page metadata, and document metadata into a pandas DataFrame.

I can create an additional pandas DataFrame with all of the page coordinates of each <path> from the svg file and combine with the text DataFrame in such a way that I can identify and tag the specific text that was either struck-out or underlined - a critical feature in my use case.

Using numpy to optimize much of these operations, I can generate the final DataFrame for a 150 page all-text pdf with abundant text underlines and strike-outs in about one second on a consumer laptop.

Also it can then be relatively straight forward to construct text formatting features to then base a document hierarchy on those features.

Combinations of the following text-formatting features can be used deduce document hierarchy:

  • Case: UPPER > Title Case > Sentence > lower
  • Font Size: Large > Small
  • Font Weight: Bold > Italic > Normal
  • Underline: Underline > No Underline
  • Line Spacing: Large > Small
  • Alignment: Centered > Left
  • Indentation: No Indent > Indent

clayms avatar Aug 21 '18 14:08 clayms

see https://github.com/HazyResearch/fonduer/issues/111#issue-352528462 for an example pdf. The mutool draw command described there converts the pdf to html, but it also misses the abundant strikeouts and underlines and all of the text that is clearly struck-out is shown as regular formatted text in the html output.

clayms avatar Aug 22 '18 15:08 clayms

I'm interested in your method to convert from a mutool-generated SVG to Fonduer's data model. Are you able to release this code? Thanks!

jbecke avatar Aug 26 '19 20:08 jbecke

Let me talk to some people. In the meantime, I could provide some pseudocode outlining the whole process (in more detail than what's above).

clayms avatar Aug 29 '19 20:08 clayms

Thanks, pseudocode or a bit more detail than above would be helpful! My email is [email protected] if you prefer to chat over email.

jbecke avatar Aug 29 '19 23:08 jbecke