pdf-reader Extra \n characters

Hi,

I'm extracting text (I'm not the author of the pdf) from http://boe.es/borme/dias/2011/08/23/pdfs/BORME-B-2011-160-28.pdf

In the first page, first line (don't count the titles), appears twice the character '\n', and I think it must appears only one. Let me show you the output:

artículo 378.7 del Reglamento del\n\n               Registro Mercantil)\n\n

I mean the characters '\n' in the middle of the string.

ruby 1.9.3p0 pdf-reader 1.3.3

Thanks and nice job!

Apr 10 '13 11:04 dpzaba

Hola David, thanks for the report.

Are we looking at the same PDF? The one you linked to does not have "artículo 378.7" anywhere in the document.

Apr 10 '13 11:04 yob

Hi James,

I'm sorry the pdf is http://boe.es/borme/dias/2011/08/23/pdfs/BORME-B-2011-160-28.pdf

Thanks again.

Apr 10 '13 11:04 dpzaba

OK, I can reproduce it here. It's not ideal - the layout algorithms in lib/pdf/reader/page_layout.rb could definitely be improved.

One idea might be to detect "blocks" of text that appear to be be close together vertically and render them as one.

I'm pressed for time at the moment so probably can't look into it much for now, but I'd happily accept any pull requests for review.

Apr 10 '13 11:04 yob

Hi James,

I think the problem is the line:

artículo 378.7 del Reglamento del\n\n               Registro Mer...

Should be:

#only one \n
artículo 378.7 del Reglamento del\n               Registro Mer...

** Be careful!! I'm a beginner in Ruby I was reading the code lib/pdf/reader/page_layout.rb in line 35:

interesting_rows(page).map(&:rstrip).join("\n")

I think the problem is the method interesting_rows receive an invalid element (an empty element) in page parameter (and then join with "\n"). Right? Maybe something wrong with TextRun? (I need to understand this better)

Apr 10 '13 16:04 dpzaba