pdf-reader icon indicating copy to clipboard operation
pdf-reader copied to clipboard

Extra \n characters

Open dpzaba opened this issue 12 years ago • 4 comments

Hi,

I'm extracting text (I'm not the author of the pdf) from http://boe.es/borme/dias/2011/08/23/pdfs/BORME-B-2011-160-28.pdf

In the first page, first line (don't count the titles), appears twice the character '\n', and I think it must appears only one. Let me show you the output:

artículo 378.7 del Reglamento del\n\n               Registro Mercantil)\n\n

I mean the characters '\n' in the middle of the string.

ruby 1.9.3p0 pdf-reader 1.3.3

Thanks and nice job!

dpzaba avatar Apr 10 '13 11:04 dpzaba

Hola David, thanks for the report.

Are we looking at the same PDF? The one you linked to does not have "artículo 378.7" anywhere in the document.

yob avatar Apr 10 '13 11:04 yob

Hi James,

I'm sorry the pdf is http://boe.es/borme/dias/2011/08/23/pdfs/BORME-B-2011-160-28.pdf

Thanks again.

dpzaba avatar Apr 10 '13 11:04 dpzaba

OK, I can reproduce it here. It's not ideal - the layout algorithms in lib/pdf/reader/page_layout.rb could definitely be improved.

One idea might be to detect "blocks" of text that appear to be be close together vertically and render them as one.

I'm pressed for time at the moment so probably can't look into it much for now, but I'd happily accept any pull requests for review.

yob avatar Apr 10 '13 11:04 yob

Hi James,

I think the problem is the line:

artículo 378.7 del Reglamento del\n\n               Registro Mer...

Should be:

#only one \n
artículo 378.7 del Reglamento del\n               Registro Mer...

** Be careful!! I'm a beginner in Ruby I was reading the code lib/pdf/reader/page_layout.rb in line 35:

interesting_rows(page).map(&:rstrip).join("\n")

I think the problem is the method interesting_rows receive an invalid element (an empty element) in page parameter (and then join with "\n"). Right? Maybe something wrong with TextRun? (I need to understand this better)

dpzaba avatar Apr 10 '13 16:04 dpzaba