pdf-reader Page#text does not return all the text

Page#text does not return all the text

Open 3ynm opened this issue 1 year ago • 3 comments

For some reason PDF::Reader#text does not return all the text on a PDF file I'm scanning. Albeit I'm able to get the text by looking at the runs directly. Here is the file: https://hacktivista.org/tmp/2700968.pdf

The text I'm unable to get through #text is LECTURA ACTUAL 15-MAY-2023

Jun 16 '23 04:06 3ynm

For the time being I just monkey-patched the class to add an :unformatted option. I'll leave it here:

require 'pdf/reader'

module PDF
  class Reader
    # PDF::Reader::Page monkey patches.
    class Page
      alias_method :_text, :text
      remove_method :text

      # @param [Hash] opts Adds :unformatted option.
      def text(opts = {})
        return runs.map(&:text).join(' ') if opts[:unformatted]

        _text(opts)
      end
    end
  end
end

Jun 16 '23 18:06 3ynm

Had the same issue as well, looking forward to see a fix merged in the library. In in the meantime, thanks @hacktivista for this monkey patch.

Oct 24 '23 13:10 pbernery

Having the same issue!

Apr 22 '24 17:04 mochetts

pdf-reader pdf-reader copied to clipboard

Page#text does not return all the text

pdf-reader
pdf-reader copied to clipboard