pdf-reader FEATURE: Extract Paragraphs

FEATURE: Extract Paragraphs

Open judy opened this issue 1 year ago • 2 comments

We're using PDF::Reader at Zipline for parsing content out of PDFs. (I also forked this project on our team repo here.) We have a number of cases where we want to pull out all of the text from multi-column PDF layouts, but PDF::Reader's visually-aligned output via page.text was still difficult for us to programmatically parse.

I saw that borb has a paragraph extraction feature (code here), and this is my attempt to implement something similar in Ruby.

I'm leaving this in Draft until I can finish the remaining todo items below. Any feedback is appreciated.

To-do:

[x] Implement type checking via Sorbet
[x] Add tests for DisjointSet

Oct 25 '23 20:10 judy

I deleted an earlier comment about misunderstanding a Sorbet error. All good for review now!

Nov 14 '23 15:11 judy

@yob I hope you're well! Any updates on this PR?

Thanks! :)

Feb 10 '24 16:02 Cosmo

pdf-reader pdf-reader copied to clipboard

FEATURE: Extract Paragraphs

pdf-reader
pdf-reader copied to clipboard