lopdf
lopdf copied to clipboard
Wrong letters in pdf
Hi,
The following code reports wrong letters (they are added by 4 for some reasons)
The culprit: https://www.dropbox.com/scl/fi/6a8zuy70s05pntvxm0vae/test.pdf?rlkey=ylju1wbavr8rff10jp621u6bo&dl=0
It reports DEF instead of ABC.
#[cfg(any(feature = "pom_parser", feature = "nom_parser"))] // same result with "pom"
let doc_res = Document::load("/path/to/test.pdf");
let mut doc = match doc_res {
Ok(v) => v,
Err(_) => return,
};
doc.decompress();
let mut page_id: u32 = 0;
for x in doc.get_pages().iter() {
let t = doc.extract_text(&[*x.0]);
match t {
Ok(b) => {
println!("{}", b);
}
Err(e) => println!("Nope {}", e),
}
}
return;
Any idea why ? anything wrong with my code ? Thanks
Your PDF example file contains DEF text content but show ABC when opened in a PDF viewer. This seems to be done by using rg/RG operators to manually draw ABC and avoiding to use Text streams or something like it (I am no PDF expert).
This behavior is caused by a bug/missing feature of lopdf.
I am relatively sure that the rg/RG operators have nothing to do with this, as they seem only set the color for whatever comes next in the content stream.
(Taken from PDF2.0 spec, Annex A: Operator Summary)
The relevant part of the PDF that is responsible for rendering the A
is only
/Fo0S0 12.00000 Tf
<44> Tj
The first line sets the font (which is defined in the resource dictionary of the page), <44>
defines the glyph that is to be rendered, and Tj
tells the reader to render that glyph.
What lopdf would now need to do, which isn't implemented yet, is:
- lookup where the font is defined (in this case
9 0 R
) - either parse the Encoding (at
7 0 R
) or the ToUnicode cmap (at8 0 R
)
At least in this case, both things contain the information that is needed to properly map <44>
to the correct character.
As we can see in line 20516
of the PDF, the glyph <44>
is indeed mapped to the Unicode <0041>, which is an
A`.
A solution to #125 may lay some groundwork for this issue to be solved.