pdf-reader icon indicating copy to clipboard operation
pdf-reader copied to clipboard

FEATURE: Extract Paragraphs

Open judy opened this issue 1 year ago • 2 comments

We're using PDF::Reader at Zipline for parsing content out of PDFs. (I also forked this project on our team repo here.) We have a number of cases where we want to pull out all of the text from multi-column PDF layouts, but PDF::Reader's visually-aligned output via page.text was still difficult for us to programmatically parse.

I saw that borb has a paragraph extraction feature (code here), and this is my attempt to implement something similar in Ruby.

I'm leaving this in Draft until I can finish the remaining todo items below. Any feedback is appreciated.

To-do:

  • [x] Implement type checking via Sorbet
  • [x] Add tests for DisjointSet

judy avatar Oct 25 '23 20:10 judy

I deleted an earlier comment about misunderstanding a Sorbet error. All good for review now!

judy avatar Nov 14 '23 15:11 judy

@yob I hope you're well! Any updates on this PR?

Thanks! :)

Cosmo avatar Feb 10 '24 16:02 Cosmo