pdf-reader icon indicating copy to clipboard operation
pdf-reader copied to clipboard

Page#text does not return extra whitespaces between words

Open rubemz opened this issue 12 years ago • 2 comments

There is another change in 1.3.0 that affected our test suite.

It looks like that even the strings that were created intentionally with double(or more) whitespaces between a word, when calling Page#text it returns a single whitespace between the words.

For example, some date strings have double whitespaces due to the format mask (%l - Hour of the day, 12-hour clock, blank-padded ( 1..12)). But, since the Page#text does not return more than a single whitespace between words, the test is breaking.

Is it a desired behavior, limit the Page#text return to a single whitespace between words, even though the original string (and the rendered one) have more than a single whitespace between words?

rubemz avatar Jan 01 '13 17:01 rubemz

pdf-reader isn't intentionally limiting whitespace between words to a single space.

It's attempting to layout text of varying sizes and styles onto a canvas that only supports fixed-width text and it's likely to get things a bit wrong sometimes. There's almost certainly room for improvement. In this case, it thinks the space between your words is "about" equal to a single space in the current font, so it only leaves a single space.

There's two areas you could look into to see if it helps in your case:

  1. remove the "unless utf8_chars == SPACE" check from line 111 of lib/pdf/reader/page_text_receiver.rb
  2. See if you can improve the layout logic in PDF::Reader::PageLayout to help

If you find an change that improves the layout code for most people I'll be happy to merge it

yob avatar Jan 05 '13 16:01 yob

  1. remove the "unless utf8_chars == SPACE" check from line 111 of lib/pdf/reader/page_text_receiver.rb

It works.

zachary-wq avatar Apr 24 '14 03:04 zachary-wq