lopdf
lopdf copied to clipboard
Implement decoding of Unicode characters
Possible duplicate of #86 if decoding and encoding would need to be implemented together.
I've got a PDF with the text 打尼爾
and
println!("{:?}", doc.extract_text(&vec![1]).unwrap());
yields:
"?Identity-H Unimplemented??Identity-H Unimplemented??Identity-H Unimplemented?\n"
Hello, any updates here?
Hey, for a personal project I needed text extraction from OCR'd PDFs which use Identity-H encoding and a ToUnicode CMap. I implemented the basic functionality into my fork of lopdf. It can be found here: https://github.com/enzingerm/lopdf/tree/unicode_cmap It works for the PDFs I work with but I'm quite sure it won't work for other kinds of PDFs due to the complexity of the standard and my basic implementation. Maybe anyone wants to give it a try. Feedback is appreciated :)