lopdf Implement decoding of Unicode characters

Implement decoding of Unicode characters

Open JohnAZoidberg opened this issue 4 years ago • 8 comments

Possible duplicate of #86 if decoding and encoding would need to be implemented together.

I've got a PDF with the text 打尼爾 and

println!("{:?}", doc.extract_text(&vec![1]).unwrap());

yields:

"?Identity-H Unimplemented??Identity-H Unimplemented??Identity-H Unimplemented?\n"

Nov 10 '20 06:11 JohnAZoidberg

Hello, any updates here?

Mar 28 '21 18:03 KoStard

Hey, for a personal project I needed text extraction from OCR'd PDFs which use Identity-H encoding and a ToUnicode CMap. I implemented the basic functionality into my fork of lopdf. It can be found here: https://github.com/enzingerm/lopdf/tree/unicode_cmap It works for the PDFs I work with but I'm quite sure it won't work for other kinds of PDFs due to the complexity of the standard and my basic implementation. Maybe anyone wants to give it a try. Feedback is appreciated :)

Sep 27 '21 08:09 enzingerm

lopdf lopdf copied to clipboard

Implement decoding of Unicode characters

lopdf
lopdf copied to clipboard