James Healy
James Healy
Yup, it's the naive algorithim in PageLayout. If I extract the text from page 1, and inspect the value of `@runs` at this point: https://github.com/yob/pdf-reader/blob/8557768313c71de59298c5da0dac1404cf50afbb/lib/pdf/reader/page_layout.rb#L20 It looks like this: ```ruby...
> I can poke at this, but I'd love to know others have and it's been thoroughly explored and maxed out already. I definitely haven't given this method much attention,...
Thanks for this report. I've been able to reproduce it on the latest master branch. There's some definite weaknesses in our glyph positioning logic, and it seems like this is...
I get the same results when trying to extract text using pdf-reader. I also tried extracting text with pdftotext (which uses libpoppler), and firefox (which uses pdf.js). Neither of them...
I can imagine ways that the logic here can result in situations with an unreasonably large array. 100,000 x 100,000 is VERY large, and I'm trying to imagine just how...
Cool, that seems like a good work around for now. If you want to upstream something, I'd be happy to accept a patch that changes PageLayout to optionally drop characters...
I'm not super familiar with the annotation options. However, my guess us the 8 `Line` annotations won't have any text associated with them. I also suspect that that text for...
Sorry I didn't get a around to looking into this in 2017 😞 I just had a proper look and confirmed this issue is still happening in v2.8.0, and that...
I suspect this is an issue with our text layout algorithms in the `PageLayout` class. Unfortunately I'm short on time at the moment, but I'll happily accept patches if you...
Thanks for reporting this issue. pdf-reader has never intentionally skipped content on a page, and nothing between 2.0.0 and 2.4.0 has changed that. I guess it's possible one of the...