pdf-reader icon indicating copy to clipboard operation
pdf-reader copied to clipboard

Extra spaces between letters in a single word

Open pickhardt opened this issue 1 year ago • 2 comments

I noticed this gem has problems parsing some PDFs where the text is not necessarily clean.

For instance, this file: https://www.jstor.org/stable/3684663

Some parts of it get output like: "a b o u t a r e g r e s s i o n t o o r i g i n a l c h a o s"

However, it doesn't seem like it's inherently a problem with the file, because Python's PyPDF2 interprets it correctly as "about a regression to original chaos"

Do you think there is some step that this reader is missing? Or alternatively is there some option I should set when using the PDF::Reader to get it to read the pdfs better?

pickhardt avatar Mar 27 '23 00:03 pickhardt

I too am experiencing this issue.

shmolf avatar Apr 15 '24 19:04 shmolf

same here.

I did some gsub. it works when the clustered word is in Pascal Case.

TheFirstWord = The First Word gsub(/([a-z])([A-Z])/, '\1 \2') thefirstword = thefirstword ???

iprog21 avatar May 31 '24 08:05 iprog21