lopdf icon indicating copy to clipboard operation
lopdf copied to clipboard

Wrong letters in pdf

Open ThomasCartier opened this issue 1 year ago • 2 comments

Hi,

The following code reports wrong letters (they are added by 4 for some reasons)

The culprit: https://www.dropbox.com/scl/fi/6a8zuy70s05pntvxm0vae/test.pdf?rlkey=ylju1wbavr8rff10jp621u6bo&dl=0

It reports DEF instead of ABC.

    #[cfg(any(feature = "pom_parser", feature = "nom_parser"))] // same result with "pom"

    let doc_res = Document::load("/path/to/test.pdf");

    let mut doc = match doc_res {
        Ok(v) => v,
        Err(_) => return,
    };

    doc.decompress();
    let mut page_id: u32 = 0;
    for x in doc.get_pages().iter() {

        let t = doc.extract_text(&[*x.0]);
        match t {
            Ok(b) => {
                println!("{}", b);
            }
            Err(e) => println!("Nope {}", e),
        }
    }

    return;

Any idea why ? anything wrong with my code ? Thanks

ThomasCartier avatar Oct 29 '23 09:10 ThomasCartier

Your PDF example file contains DEF text content but show ABC when opened in a PDF viewer. This seems to be done by using rg/RG operators to manually draw ABC and avoiding to use Text streams or something like it (I am no PDF expert).

Angr1st avatar Jun 19 '24 20:06 Angr1st

This behavior is caused by a bug/missing feature of lopdf.

I am relatively sure that the rg/RG operators have nothing to do with this, as they seem only set the color for whatever comes next in the content stream. image (Taken from PDF2.0 spec, Annex A: Operator Summary)

The relevant part of the PDF that is responsible for rendering the A is only

/Fo0S0 12.00000 Tf
<44> Tj

The first line sets the font (which is defined in the resource dictionary of the page), <44> defines the glyph that is to be rendered, and Tj tells the reader to render that glyph.

What lopdf would now need to do, which isn't implemented yet, is:

  • lookup where the font is defined (in this case 9 0 R)
  • either parse the Encoding (at 7 0 R) or the ToUnicode cmap (at 8 0 R)

At least in this case, both things contain the information that is needed to properly map <44> to the correct character. As we can see in line 20516 of the PDF, the glyph <44> is indeed mapped to the Unicode <0041>, which is an A`.

A solution to #125 may lay some groundwork for this issue to be solved.

Heinenen avatar Aug 09 '24 00:08 Heinenen