James Healy comments

Results 139 comments of


                                            James Healy

Superscript words not being returned.

Yup, it's the naive algorithim in PageLayout. If I extract the text from page 1, and inspect the value of `@runs` at this point: https://github.com/yob/pdf-reader/blob/8557768313c71de59298c5da0dac1404cf50afbb/lib/pdf/reader/page_layout.rb#L20 It looks like this: ```ruby...

Performance of `prepare_regular_token`?

> I can poke at this, but I'd love to know others have and it's been thoroughly explored and maxed out already. I definitely haven't given this method much attention,...

doesn't read Correct Data

Thanks for this report. I've been able to reproduce it on the latest master branch. There's some definite weaknesses in our glyph positioning logic, and it seems like this is...

Unable to extract text from pdf/a (with flat decode)

I get the same results when trying to extract text using pdf-reader. I also tried extracting text with pdftotext (which uses libpoppler), and firefox (which uses pdf.js). Neither of them...

creating page layout uses a lot of memory

I can imagine ways that the logic here can result in situations with an unreasonably large array. 100,000 x 100,000 is VERY large, and I'm trying to imagine just how...

creating page layout uses a lot of memory

Cool, that seems like a good work around for now. If you want to upstream something, I'd be happy to accept a patch that changes PageLayout to optionally drop characters...

Getting Hyperlink Text

I'm not super familiar with the annotation options. However, my guess us the 8 `Line` annotations won't have any text associated with them. I also suspect that that text for...

Strange behaviour parsing PDF File

Sorry I didn't get a around to looking into this in 2017 😞 I just had a proper look and confirmed this issue is still happening in v2.8.0, and that...

extracted text does not match text of pdf

I suspect this is an issue with our text layout algorithms in the `PageLayout` class. Unfortunately I'm short on time at the moment, but I'll happily accept patches if you...

Ignore pdf footer while reading

Thanks for reporting this issue. pdf-reader has never intentionally skipped content on a page, and nothing between 2.0.0 and 2.4.0 has changed that. I guess it's possible one of the...