pdf-reader
pdf-reader copied to clipboard
Page#text does not return all the text
For some reason PDF::Reader#text
does not return all the text on a PDF file I'm scanning. Albeit I'm able to get the text by looking at the runs directly. Here is the file: https://hacktivista.org/tmp/2700968.pdf
The text I'm unable to get through #text
is LECTURA ACTUAL 15-MAY-2023
For the time being I just monkey-patched the class to add an :unformatted
option. I'll leave it here:
require 'pdf/reader'
module PDF
class Reader
# PDF::Reader::Page monkey patches.
class Page
alias_method :_text, :text
remove_method :text
# @param [Hash] opts Adds :unformatted option.
def text(opts = {})
return runs.map(&:text).join(' ') if opts[:unformatted]
_text(opts)
end
end
end
end
Had the same issue as well, looking forward to see a fix merged in the library. In in the meantime, thanks @hacktivista for this monkey patch.
Having the same issue!