pdf icon indicating copy to clipboard operation
pdf copied to clipboard

Reading contents of a PDF

Open santiagomed opened this issue 1 year ago • 9 comments

Is there an example on how to simply read the contents of a PDF successfully? I tried looking into read.rs but it seems to be outdated so I can't run it. Any way to read a PDF?

santiagomed avatar Sep 15 '23 18:09 santiagomed

What content do you want? There is a lot in there.

  • Content stream? You can get that from the page object.
  • Text? See the pdf_render and pdf_text crates.

You can use the pdf crate in two version:

  • from crates.io, then use the example that match it: https://github.com/pdf-rs/pdf/tree/a6e2abc96b23b64aa1051966bb000aabf1275d9f
  • master with the latest fixes.

The pdf_render and pdf_text crates only work with the latest master.

s3bk avatar Sep 17 '23 01:09 s3bk

What are the pdf_render and pdf_text crates ? crates.io doesn't know anything about that.

vjau avatar Dec 08 '23 15:12 vjau

They are not on crates.io because they do not meet my stability requirements for publishing there. pdf_render … renders pdfs. pdf_text extracts text.

s3bk avatar Dec 08 '23 16:12 s3bk

pdf-extract crate exists, but depends on lopdf, not pdf. This video benchmarks it against poppler, a C library.

I'd be curious to see a C/Rust comparison but with poppler against pdf_text.

alexis779 avatar May 07 '24 19:05 alexis779

Any chance for an easy example that just converts a PDF file to a String?

I need to search through valid utf8 text of a pdf and not panic if the pdf is formatted in any unexpected way..

Documentation found regarding this seems so scarce..

Gisbert12843 avatar Jul 19 '24 19:07 Gisbert12843

If pdf_text does not do what you need, then no, there is no easy example. This is not an easy problem. I have been working on this multiple years now and thrown many algorithms at it, and still it is not perfect. pdf_render renders the pdf and allows you to capture the drawn strings. Thats as good as it gets.

s3bk avatar Jul 19 '24 20:07 s3bk

Ahh thank you for clarifying that!

Unrelated to this project i was working with lopdf on that task. Everything worked up until a pdf file does not follow regular encoding aka is corrupted or chinese xd

Sadly lopdf just panics in every case and does not error instead. Weird behaviour from my pov.

Gisbert12843 avatar Jul 19 '24 21:07 Gisbert12843

Oh sure. If everything is in standard encoding, it is easy. And yes, I tried to not panic in the pdf crate. pdf_render might panic, but that would be a bug and needs fixing.

pdf_render is used in production with "random" PDFs. And it's not great for a server to crash from a user supplied PDF.

s3bk avatar Jul 19 '24 21:07 s3bk

@santiagomed I wanted to do the same thing as you did. Thank you for flagging this issue!

@alexis779 Thank you, pdf-extract works like a charm!

acro5piano avatar Jul 31 '24 09:07 acro5piano