Page#text does not return extra whitespaces between words
There is another change in 1.3.0 that affected our test suite.
It looks like that even the strings that were created intentionally with double(or more) whitespaces between a word, when calling Page#text it returns a single whitespace between the words.
For example, some date strings have double whitespaces due to the format mask (%l - Hour of the day, 12-hour clock, blank-padded ( 1..12)). But, since the Page#text does not return more than a single whitespace between words, the test is breaking.
Is it a desired behavior, limit the Page#text return to a single whitespace between words, even though the original string (and the rendered one) have more than a single whitespace between words?
pdf-reader isn't intentionally limiting whitespace between words to a single space.
It's attempting to layout text of varying sizes and styles onto a canvas that only supports fixed-width text and it's likely to get things a bit wrong sometimes. There's almost certainly room for improvement. In this case, it thinks the space between your words is "about" equal to a single space in the current font, so it only leaves a single space.
There's two areas you could look into to see if it helps in your case:
- remove the "unless utf8_chars == SPACE" check from line 111 of lib/pdf/reader/page_text_receiver.rb
- See if you can improve the layout logic in PDF::Reader::PageLayout to help
If you find an change that improves the layout code for most people I'll be happy to merge it
- remove the "unless utf8_chars == SPACE" check from line 111 of lib/pdf/reader/page_text_receiver.rb
It works.