lopdf icon indicating copy to clipboard operation
lopdf copied to clipboard

Implement decoding of Unicode characters

Open JohnAZoidberg opened this issue 4 years ago • 8 comments

Possible duplicate of #86 if decoding and encoding would need to be implemented together.

I've got a PDF with the text 打尼爾 and

println!("{:?}", doc.extract_text(&vec![1]).unwrap());

yields:

"?Identity-H Unimplemented??Identity-H Unimplemented??Identity-H Unimplemented?\n"

JohnAZoidberg avatar Nov 10 '20 06:11 JohnAZoidberg

Hello, any updates here?

KoStard avatar Mar 28 '21 18:03 KoStard

Hey, for a personal project I needed text extraction from OCR'd PDFs which use Identity-H encoding and a ToUnicode CMap. I implemented the basic functionality into my fork of lopdf. It can be found here: https://github.com/enzingerm/lopdf/tree/unicode_cmap It works for the PDFs I work with but I'm quite sure it won't work for other kinds of PDFs due to the complexity of the standard and my basic implementation. Maybe anyone wants to give it a try. Feedback is appreciated :)

enzingerm avatar Sep 27 '21 08:09 enzingerm