pdf-reader
pdf-reader copied to clipboard
FEATURE: Extract Paragraphs
We're using PDF::Reader at Zipline for parsing content out of PDFs. (I also forked this project on our team repo here.) We have a number of cases where we want to pull out all of the text from multi-column PDF layouts, but PDF::Reader's visually-aligned output via page.text
was still difficult for us to programmatically parse.
I saw that borb has a paragraph extraction feature (code here), and this is my attempt to implement something similar in Ruby.
I'm leaving this in Draft until I can finish the remaining todo items below. Any feedback is appreciated.
To-do:
- [x] Implement type checking via Sorbet
- [x] Add tests for DisjointSet
I deleted an earlier comment about misunderstanding a Sorbet error. All good for review now!
@yob I hope you're well! Any updates on this PR?
Thanks! :)