pdf-reader icon indicating copy to clipboard operation
pdf-reader copied to clipboard

Page.text fails when font size changes on a single line

Open coezbek opened this issue 4 years ago • 1 comments

When reading text from a document that uses different font sizes on the same line of text, I have seen that fail both as extra spaces and overridden characters. I am wondering is this something that pdf-reader is intended to do accurately?

Example file: "hello_world_caps.pdf"

hello_world_caps.pdf

Example spec (fails):

 describe "#text" do
    ...

    it "can deal with different height characters on the same line" do
      @browser = PDF::Reader.new(pdf_spec_file("hello_world_caps"))
      @page    = @browser.page(1)

      expect(@page.text).to eql("HELLO WORLD") # Returns "HELLWORLD"
    end

  end

coezbek avatar Oct 22 '21 21:10 coezbek

Thanks for a great sample file that demonstrates the issue.

I am wondering is this something that pdf-reader is intended to do accurately?

I would classify it as a known issue that I'd like to handle better than we currently do. Probably the algorithm in PageLayout needs a significant overhaul, which is a bummer.

yob avatar Oct 22 '21 22:10 yob