printpdf icon indicating copy to clipboard operation
printpdf copied to clipboard

How to analyze parsing issues with an existing PDF?

Open ronnybremer opened this issue 2 months ago • 30 comments

I am currently using this crate to create new PDFs, which works very well. A new task requires me to use an existing PDF and add text to it (not a form, a simple PDF). So I adopted the code to parse the PDF, get the page ops and add to it. Unfortunately, the entire page was just black after saving it. So I removed my code to add ops to the page and simply do this:

    let mut warnings: Vec<PdfWarnMsg> = Vec::new();
    let mut pdf = PdfDocument::parse(
        original_pdf,
        &PdfParseOptions {
            fail_on_error: true,
        },
        &mut warnings,
    )
    .map_err(|err| anyhow!(err))?;
    error!("warnings: {:?}", warnings);
    // remove all pages but the first
    pdf.pages = pdf.pages.iter().take(1).map(|page| page.clone()).collect();
    let mut page = pdf.pages[0].clone();
    pdf.pages = vec![];
    // add content to page - currently disabled

    // return the result
    Ok(pdf
        .with_pages(vec![page])
        .save(&PdfSaveOptions::default(), &mut Vec::new()))

expecting the saved PDF to be the first page of the original PDF. However, the page is still fully black: Image

In the above code the following parser warnings are returned:

warnings: [
PdfWarnMsg { page: 0, op_id: 0, severity: Warning, msg: "Type0 font C2_0 missing DescendantFonts" },
PdfWarnMsg { page: 0, op_id: 0, severity: Warning, msg: "Unknown base font: FKZKQC+UniversLTStd-Bold" },
PdfWarnMsg { page: 0, op_id: 0, severity: Warning, msg: "Unknown base font: FKZKQC+UniversLTStd" },
PdfWarnMsg { page: 0, op_id: 0, severity: Warning, msg: "Unknown base font: FKZKQC+UniversLTStd-Cn" },
PdfWarnMsg { page: 1, op_id: 0, severity: Warning, msg: "Unknown base font: FKZKQC+UniversLTStd-Bold" },
PdfWarnMsg { page: 1, op_id: 0, severity: Warning, msg: "Unknown base font: FKZKQC+UniversLTStd" },
PdfWarnMsg { page: 0, op_id: 0, severity: Error, msg: "Info: unhandled operator 'K'" },
PdfWarnMsg { page: 0, op_id: 978, severity: Error, msg: "Info: unhandled operator 'k'" },
PdfWarnMsg { page: 0, op_id: 986, severity: Error, msg: "Info: unhandled operator 'k'" },
PdfWarnMsg { page: 0, op_id: 1034, severity: Error, msg: "Warning: 'EMC' with no current_layer" },
PdfWarnMsg { page: 0, op_id: 1044, severity: Error, msg: "Warning: 'EMC' with no current_layer" },
PdfWarnMsg { page: 0, op_id: 1054, severity: Error, msg: "Warning: 'EMC' with no current_layer" },
PdfWarnMsg { page: 0, op_id: 1065, severity: Error, msg: "Warning: 'EMC' with no current_layer" },
PdfWarnMsg { page: 0, op_id: 1076, severity: Error, msg: "Warning: 'EMC' with no current_layer" },
PdfWarnMsg { page: 0, op_id: 1093, severity: Error, msg: "Warning: 'EMC' with no current_layer" },
PdfWarnMsg { page: 1, op_id: 0, severity: Error, msg: "Info: unhandled operator 'K'" },
PdfWarnMsg { page: 1, op_id: 28, severity: Error, msg: "Info: unhandled operator 'k'" },
PdfWarnMsg { page: 1, op_id: 41, severity: Error, msg: "Warning: 'EMC' with no current_layer" },
PdfWarnMsg { page: 1, op_id: 43, severity: Error, msg: "Info: unhandled operator 'k'" },
PdfWarnMsg { page: 1, op_id: 52, severity: Error, msg: "Info: unhandled operator 'k'" },
PdfWarnMsg { page: 1, op_id: 60, severity: Error, msg: "Warning: 'EMC' with no current_layer" },
PdfWarnMsg { page: 1, op_id: 62, severity: Error, msg: "Info: unhandled operator 'k'" },
PdfWarnMsg { page: 1, op_id: 70, severity: Error, msg: "Info: unhandled operator 'k'" },
PdfWarnMsg { page: 1, op_id: 78, severity: Error, msg: "Warning: 'EMC' with no current_layer" },
PdfWarnMsg { page: 1, op_id: 80, severity: Error, msg: "Info: unhandled operator 'k'" },
PdfWarnMsg { page: 1, op_id: 88, severity: Error, msg: "Info: unhandled operator 'k'" },
PdfWarnMsg { page: 1, op_id: 102, severity: Error, msg: "Warning: 'EMC' with no current_layer" },
PdfWarnMsg { page: 1, op_id: 104, severity: Error, msg: "Info: unhandled operator 'k'" },
PdfWarnMsg { page: 1, op_id: 112, severity: Error, msg: "Info: unhandled operator 'k'" }
]

My first impression is, that an outline around the page is missing its width, causing it to be infinite and therefore covering the entire page. What would be the best way to troubleshoot this? Is this more a lopdf issue?

I can provide the PDF in question, but not in a public space.

Thank you for your advice.

ronnybremer avatar Nov 03 '25 08:11 ronnybremer

PdfWarnMsg { page: 1, op_id: 80, severity: Error, msg: "Info: unhandled operator 'k'" },
PdfWarnMsg { page: 0, op_id: 0, severity: Error, msg: "Info: unhandled operator 'K'" },

This is the PDF CMYK operator:

  • c m y k K sets the color for subsequent stroking operations (outlines).
  • c m y k k sets the color for subsequent non-stroking operations (fills).

Right now, I think the CMYK operations were ignored. I don't exactly remember the reason as to why.

It is likely that your PDF has some form of background rectangle, which is encoded in CMYK space. printpdf currently ignores CMYK operations, and therefore it falls back to the default fill color, which is not white, but black.

This should be fixed in printpdf, lopdf is not the cause.

fschutt avatar Nov 03 '25 10:11 fschutt

Thank you for the info @fschutt. I would not know enough to fix it myself, but I will try to convert the PDF into RGB, if thats even possible. Indeed, there are two little triangles on the first page, which I assume are the once causing the issue. I'll keep you updated.

ronnybremer avatar Nov 03 '25 11:11 ronnybremer

Allright. Converting the PDF from CMYK to RGB with the command

gs -sDEVICE=pdfwrite -dBATCH -dNOPAUSE -dCompatibilityLevel=1.4 -dColorConversionStrategy=/sRGB -dProcessColorModel=/DeviceRGB -sOutputFile=test.pdf test.pdf.original

worked. The PDF now shows the first page upon save.

Only issue is, all text is gone. I do assume it's because of the missing fonts. I tried to add the fonts to the PdfDocument asset list after parsing, but that hasn't helped so far. Maybe adding them to the OS will help, I will continue to do some tests. FYI: I added the fonts directly to the resources.fonts.map by using the same FontId as printed in the warnings. Not sure if that is the correct approach or if the text got already lost during parsing.

I am open to any advice. Thanks!

ronnybremer avatar Nov 03 '25 12:11 ronnybremer

I forgot to add the warnings I received from parse, which have changed since the conversion to RGB:

warnings: [
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Found font data for R13 (3464 bytes)" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully read font data" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully loaded font at index 0" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully read HEAD table" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "MAXP table: 227 glyphs" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Font has 227 glyphs" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully read LOCA table" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully read GLYF table" },
PdfWarnMsg { page: 0, op_id: 0, severity: Warning, msg: "Failed to create allsorts Font: an expected data value was missing" }, PdfWarnMsg { page: 0, op_id: 0, severity: Error, msg: "Failed to parse font data for R13" },
PdfWarnMsg { page: 0, op_id: 0, severity: Warning, msg: "Unknown base font: ECNDYC+UniversLTStd-Cn" },
PdfWarnMsg { page: 0, op_id: 0, severity: Warning, msg: "Unknown base font: DRPAUO+UniversLTStd" },
PdfWarnMsg { page: 0, op_id: 0, severity: Warning, msg: "Unknown base font: CBRIKF+UniversLTStd-Bold" },
PdfWarnMsg { page: 1, op_id: 0, severity: Warning, msg: "Unknown base font: DRPAUO+UniversLTStd" },
PdfWarnMsg { page: 1, op_id: 0, severity: Warning, msg: "Unknown base font: CBRIKF+UniversLTStd-Bold" }]

ronnybremer avatar Nov 03 '25 12:11 ronnybremer

PdfWarnMsg { 
    page: 0, 
    op_id: 0, 
    severity: Warning, 
    msg: "Failed to create allsorts Font: an expected data value was missing" 
}, 
PdfWarnMsg { 
    page: 0, 
    op_id: 0, 
    severity: Error, 
    msg: "Failed to parse font data for R13" 
},

Mmh, that seems to be the problem. In allsorts, this is a ParseError::MissingValue, but I cannot trace where it's actually thrown.

  1. Try using Inkscape to import the PDF, I just want to confirm that it was the CMYK that was the "black page problem"
  2. Is there now anything on the page (do other Ops get exported properly)? So it's only the fonts missing, right?
  3. I would need to see what's up with the "R13" font. printpdf parses it here:

https://github.com/fschutt/printpdf/blob/ba84a83e9fd2064041c0e31ab90a87b1e01f2874/src/font.rs#L1005-L1013

Clone the repository locally and just std::fs::write(font_bytes) the font to a file. I need to see why the R13 font can't be parsed.

printpdf needs to have access to the fonts to get things like the reverse glyph mapping (mapping glyphs back to characters), etc. (TODO: technically we could fall back to the PDFs glyph ID table, but still). Decoding the fonts from the PDF back is rather crucial for the functionality. The font data seems to be found, but there's an error parsing the actual font.

Sorry for the bad experience, those parsing code paths aren't very well tested.

fschutt avatar Nov 03 '25 13:11 fschutt

I shall do that and give you feedback. To your questions:

  1. Inkscape opens it perfectly, but substitutes the fonts as I don't have them installed on my Mac. According to the document properties, no color profile was initially defined, so I assume the CMYK ops came from the initial export done in InDesign.
  2. Yes, all elements but the text are visible and rendered correctly on the saved PDF.
  3. Working on it.

And btw, I don't have a bad experience :) PDFs are harder than they originally look like and getting this workflow done is a challenge. But without your crate it would be impossible.

ronnybremer avatar Nov 03 '25 13:11 ronnybremer

Ok, got the font saved. I'll upload it here.

Interestingly, though, the cloned repo panics:

thread 'tokio-runtime-worker' panicked at /home/me/printpdf/src/font.rs:1056:71:
called `Result::unwrap()` on an `Err` value: MissingValue

r13-font.zip

ronnybremer avatar Nov 03 '25 14:11 ronnybremer

~~Hmm. Looks like the GIT version has an issue with my previous code (generating a new PDF from scratch). Haven't changed a bit, PDF comes out at 111k, but opening it gives "document contains only empty pages". Thought I let you know, I'll dig through the changes from 8.2 to main.~~

Please disregard. The debugging I did caused this.

ronnybremer avatar Nov 04 '25 10:11 ronnybremer

I took the saved font file and tried to parse it like this:

    let mut warnings: Vec<PdfWarnMsg> = Vec::new();
    let tst_font_bytes = include_bytes!("/tmp/r13-font");
    let tst_font =
        printpdf::ParsedFont::from_bytes(tst_font_bytes, 0, &mut warnings).unwrap();
    error!("warnings: {:?}", warnings);

strangely, it worked, here are the warnings:

warnings: [
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully read font data" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully loaded font at index 0" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully read HEAD table" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "MAXP table: 44 glyphs" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Font has 44 glyphs" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully read LOCA table" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully created allsorts Font" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully parsed cmap subtable" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully read HMTX data" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully read HHEA table" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Font metrics: units_per_em=2048" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully read GLYF table" },
PdfWarnMsg { page: 0, op_id: 0, severity: Warning, msg: "Failed to convert glyph 1 to OwnedGlyph from GLYF font" }, PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully decoded 43 glyphs" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Font space width: 569" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Font parsing completed successfully" }]

A single conversion error for a glyph but otherwise the font gets parsed. I'll dig a bit more into the code to see why this parameter missing is thrown.

ronnybremer avatar Nov 04 '25 10:11 ronnybremer

Nope, that must have been a fluke in my testing. Could have hit a code path which overrode the /tmp/r13-font file. Digging further.

ronnybremer avatar Nov 04 '25 12:11 ronnybremer

Added those debug statements to src/fonts.rs just before line 1056:

use allsorts_subset_browser::font_data::FontData::OpenType;
use allsorts_subset_browser::font_data::FontData::Woff;
use allsorts_subset_browser::font_data::FontData::Woff2;
match font_file {
  OpenType(_) => println!("font_file: OpenType"),
  Woff(_) => println!("font_file: Woff"),
  Woff2(_) => println!("font_file: Woff2"),
}
println!("provider: {:?}", provider.table_tags());

When parsing my own TTF it prints:

font_file: OpenType
provider: Some([1330851634, 1668112752, 1668707360, 1718642541, 1735162214, 1751474532, 1751672161, 1752003704, 1819239265, 1835104368, 1851878757, 1886352244, 1886545264])

During the PDF parsing process it prints:

font_file: OpenType
provider: Some([1668707360, 1718642541, 1735162214, 1751474532, 1751672161, 1752003704, 1819239265, 1835104368, 1886545264])

So obviously a few numbers are missing there in the table tags, don't know what they mean yet.

ronnybremer avatar Nov 04 '25 12:11 ronnybremer

(AI slop)

So obviously a few numbers are missing there in the table tags, don't know what they mean yet.

Integer 4-Byte Hex ASCII Tag Table Name Status in PDF
1330851634 4F532F32 OS/2 OS/2 and Windows Metrics Missing
1668112752 636d6170 cmap Character to Glyph Mapping Missing
1668707360 63767420 cvt Control Value Table Present
1718642541 6670676d fpgm Font Program Present
1735162214 676c7966 glyf Glyph Data Missing
1751474532 68656164 head Font Header Missing
1751672161 68686561 hhea Horizontal Header Present
1752003704 686d7478 hmtx Horizontal Metrics Present
1819239265 6d617870 maxp Maximum Profile Present
1835104368 6e616d65 name Naming Table Missing
1851878757 706f7374 post PostScript Missing
1886352244 70726570 prep Control Value Program Present

The most critical missing tables for the allsorts parser are:

  • head (Font Header): This is absolutely fundamental. It contains global information about the font, like the version number, creation date, and crucially, the unitsPerEm value which is essential for scaling glyphs.
  • maxp (Maximum Profile): Contains the number of glyphs in the font. Without this, the parser doesn't know how many glyphs to expect.
  • cmap (Character to Glyph Mapping): This table maps character codes (like the letter 'A') to glyph indices. It's essential for figuring out which glyph to draw for a given character.
  • glyf (Glyph Data): This table contains the actual vector outlines for each glyph. Without this, there is no visual representation of the characters.

The parser cannot "create" the font object in memory because the foundational data it needs is absent from the byte stream it received from printpdf.

However: The fact that parsing the saved file works, but parsing the stream from the PDF fails, points to one of two possibilities:

  1. Incomplete Font Subsetting: The PDF creator did not embed the full font. It performed "subsetting," which is common to reduce file size. However, it may have created an incomplete or non-compliant subset of the font, stripping out tables that printpdf / allsorts considers mandatory. A more lenient parser (like the one in Inkscape or Adobe Reader) might be able to reconstruct the missing data or work around it.

  2. A Bug in printpdf's Font Extraction: This is also a strong possibility. The code in printpdf that is responsible for finding and extracting the font data stream from the PDF's internal object structure might be flawed. It could be misinterpreting the stream's length or location, resulting in a truncated byte stream being passed to ParsedFont::from_bytes. This truncated stream would lack the tables located towards the end of the original font file, which perfectly explains your observations.

Given that a standalone test with the full font file works, a bug in how printpdf is extracting the font stream from the PDF is the most likely culprit. It is not reading the entire font program from the PDF into the font_bytes slice.

fschutt avatar Nov 04 '25 13:11 fschutt

Definitely subsetting. I ran the PDF through an analyzer:

Font count: 4 Byte size: 11,766 (39.24% of the file)

object name type encoding embedded subset unicode bytes
17 0 FKZKQC+UniversLTStd-Bold Type 1C WinAnsi yes yes yes 3,913
18 0 FKZKQC+UniversLTStd Type 1C WinAnsi yes yes yes 3,642
24 0 FKZKQC+Webdings CID TrueType Identity-H yes yes yes 2,377
25 0 FKZKQC+UniversLTStd-Cn Type 1C WinAnsi yes yes yes 1,834

ronnybremer avatar Nov 04 '25 14:11 ronnybremer

Comparing the hexdump of the saved font /tmp/r13-font with the output of the embedded font object in the analyzer, it appears to be the Webdings font having the issue. As you mentioned above, a few parts are missing indeed, for example the font object in the PDF starts with:

74 72 75 65 00 0b 00 80 00 03 00 30 4f 53 2f 32 true.... ...0OS/2 00 00 00 00 00 00 00 bc 00 00 00 56 63 76 74 20 ........ ...Vcvt

while the parsed font starts with:

00 01 00 00 00 09 00 80 00 03 00 10 63 76 74 20 ............cvt

but looking further down I would say ~80% of the hex data is identical.

ronnybremer avatar Nov 04 '25 14:11 ronnybremer

It looks more and more like gs made some mistakes in converting the PDF to RGB. The font table in the analyzer of the converted file looks like this, a bit different from the original above:

Font count: 4 Byte size: 10,273 (46.04% of the file)

object name type encoding embedded subset unicode bytes
8 0 CBRIKF+UniversLTStd-Bold Type 1C WinAnsi yes yes no 3,713
10 0 DRPAUO+UniversLTStd Type 1C WinAnsi yes yes no 3,451
13 0 YMADNI+Webdings CID TrueType Identity-H yes yes yes 1,479
16 0 ECNDYC+UniversLTStd-Cn Type 1C WinAnsi yes yes no 1,630

And when looking at the hexdump of the embedded Webdings font, they look identical. Wow. I will try some other way to convert the PDF to RGB.

ronnybremer avatar Nov 04 '25 14:11 ronnybremer

Yes, printpdf does subset the fonts when the PDF was created with printpdf. Also the fonts included with printpdf are subsetted, but they're usually not included in the PDF. I sadly don't have that much time to debug this right now.

fschutt avatar Nov 04 '25 15:11 fschutt

You have already helped a lot @fschutt! Thank you for that. I will continue my digging and keep you updated here on any findings.

ronnybremer avatar Nov 04 '25 16:11 ronnybremer

I managed to come a bit further. The Cmyk issue is fixed, the rectangle is parsed correctly. No more black pages. Fixed the font reference loading and it tries to decode the embedded fonts. However, there I am stuck. Fonts are embedded as Type1C, so I would assume they are in CFF format. Unfortunately, the allsorts font library expects those fonts to start with the magic tag `OTTO', but the embedded fonts don't. According to some research they should be normal OTF fonts and the hexdump seems to confirm that. Even the CFF header is present and intact, just the magic tag is missing.

Trying to embed the fonts again in Inkscape also doesn't work, as they are embedded the same way.

ronnybremer avatar Nov 05 '25 17:11 ronnybremer

Hmm, if it's certain that the font is correctly read from the Dictionary (use a PDF debugger tool to extract the font file), then we can implement some kind of "retry" mechanism, i.e. "if the font fails to parse, try auto-fixing the first bytes and see if it parses again". I just want to confirm that this isn't a bug in printpdf, but a bug in how the font got embedded in the file.

fschutt avatar Nov 06 '25 08:11 fschutt

More research and testing done. But I am finding myself in unknown territory here. The embedded fonts are Type1C, so CFF. I can confirm, that the stream is correctly deflated and read. Running it through the CFF parser from allsorts works fine.

        let scope = ReadScope::new(font_bytes);
        let font_file = match scope.read::<CFF<'_>>() {
            Ok(ff) => {
                ff
            }
            Err(e) => {
                return None;
            }
        };

        println!("CFF header: {:?}", font_file.header);
        font_file.name_index.iter().enumerate().for_each(|(i, n)| println!("name_idx {}: {}", i, String::from_utf8_lossy(n)));
        font_file.fonts.iter().for_each(|font| {
            match &font.charset {
                allsorts_subset_browser::cff::Charset::ISOAdobe => println!("ISOAdobe"),
                allsorts_subset_browser::cff::Charset::Expert => println!("Expert"),
                allsorts_subset_browser::cff::Charset::ExpertSubset => println!("ExpertSubset"),
                allsorts_subset_browser::cff::Charset::Custom(custom_charset) => println!("Custom"),
            }
            match &font.data {
                CFFVariant::CID(ciddata) => println!("CID data"),
                CFFVariant::Type1(type1_data) => {
                    println!("Type1 data");
                    match &type1_data.encoding {
                        Encoding::Standard => println!("standard encoding"),
                        Encoding::Expert => println!("expert encoding"),
                        Encoding::Custom(custom_encoding) => println!("custom encoding"),
                    }
                    println!("data privdict: {:#?}", type1_data.private_dict);
                },
            }
            println!("font: {:#?}", font.top_dict);
        });

Result:

CFF header: Header { major: 1, minor: 0, hdr_size: 4, off_size: 2 }
name_idx 0: FKZKQC+UniversLTStd
Custom
Type1 data
standard encoding
data privdict: Dict {
    dict: [
        (
            BlueValues,
            [
                Integer(
                    -19,
                ),
                Integer(
                    19,
                ),
                Integer(
                    720,
                ),
                Integer(
                    21,
                ),
                Integer(
                    -239,
                ),
                Integer(
                    17,
                ),
                Integer(
                    173,
                ),
                Integer(
                    16,
                ),
            ],
        ),
        (
            OtherBlues,
            [
                Integer(
                    274,
                ),
                Integer(
                    9,
                ),
                Integer(
                    137,
                ),
                Integer(
                    1,
                ),
                Integer(
                    -630,
                ),
                Integer(
                    19,
                ),
            ],
        ),
        (
            StdHW,
            [
                Integer(
                    86,
                ),
            ],
        ),
        (
            StdVW,
            [
                Integer(
                    100,
                ),
            ],
        ),
        (
            DefaultWidthX,
            [
                Integer(
                    278,
                ),
            ],
        ),
        (
            NominalWidthX,
            [
                Integer(
                    607,
                ),
            ],
        ),
    ],
    default: PhantomData<allsorts_subset_browser::cff::PrivateDictDefault>,
}
font: Dict {
    dict: [
        (
            Notice,
            [
                Integer(
                    391,
                ),
            ],
        ),
        (
            Weight,
            [
                Integer(
                    388,
                ),
            ],
        ),
        (
            PostScript,
            [
                Integer(
                    392,
                ),
            ],
        ),
        (
            FontBBox,
            [
                Integer(
                    -168,
                ),
                Integer(
                    -250,
                ),
                Integer(
                    992,
                ),
                Integer(
                    947,
                ),
            ],
        ),
        (
            Charset,
            [
                Offset(
                    338,
                ),
            ],
        ),
        (
            CharStrings,
            [
                Offset(
                    393,
                ),
            ],
        ),
        (
            Private,
            [
                Offset(
                    32,
                ),
                Offset(
                    3972,
                ),
            ],
        ),
    ],
    default: PhantomData<allsorts_subset_browser::cff::TopDictDefault>,
}

But there doesn't seem to be any function in the allsorts crate to convert this into an OpenType font, which could then provide the need tables for printpdf to work. It should be possible, though, for example there is a somewhat older Java based converter from CFF to OpenType here Github FontVerter

So the data is available after parsing the CFF, but basically it needs to be reconstructed into a full font. Could the allsorts crate provide this? Or am I completely off track here?

ronnybremer avatar Nov 06 '25 14:11 ronnybremer

Hmm, if it's certain that the font is correctly read from the Dictionary (use a PDF debugger tool to extract the font file), then we can implement some kind of "retry" mechanism, i.e. "if the font fails to parse, try auto-fixing the first bytes and see if it parses again". I just want to confirm that this isn't a bug in printpdf, but a bug in how the font got embedded in the file.

Unfortunately, adding the OTTO tag in front of the embedded font doesn't work, as CFF is a different layout than OpenType. Tried that and failed.

ronnybremer avatar Nov 06 '25 14:11 ronnybremer

Thank you @fschutt for accepting the PR. I have found a workaround for the moment, which allows me to correctly produce the PDF I need.

  1. parse the PDF
  2. embedded CFF fonts are found but not parsed
  3. add those fonts to the PDF with add_font
  4. iterate over all pages and operations, replacing the font_id in text write and font size operations
  5. also replacing the known unicode characters with the correct ones
  6. save the PDF

Without any chance of correctly parsing the embedded fonts I see no other solution. I does allow me to continue with my development, as the PDFs I use as a base are static and known.

ronnybremer avatar Nov 07 '25 15:11 ronnybremer

If allsorts parses it, it should be possible to create a TTF font using a subset::whole_font - https://docs.rs/allsorts/latest/allsorts/subset/fn.whole_font.html

fschutt avatar Nov 07 '25 22:11 fschutt

Amazing how you can find functions like that in the allsorts crate, even though its documentation is quite limited. I will look into it, maybe thats the missing piece. Could you please also take a look at #249? Thank you!

ronnybremer avatar Nov 08 '25 10:11 ronnybremer

According to my current discussion with @wezm, I would need to alter my approach. It becomes apparent that the font can not be fully reconstructed from the embedded CFF info (which is highly logical, as its subsetted font info to begin with). So I might need to do a bit more work:

introduce a new structure EmbeddedCFFFont which represents a subset of the information of BuiltinOrParsedFont, but enough to embed it again in the resulting PDF and use it for char mapping during parsing. During serialize embed the CFF font again, unaltered, keeping the name.

The user of the crate can not add any more text with that font anyway, as it doesn't contain all glyphs. They might get away with some edge cases, where the added text falls into the subsetted cmap. If more text needs to be added, the full font needs to be added to the PDF, just as I do at the moment as a workaround. It is, however, treated as a separate font. Alternatively, a function to replace a subsetted font with a full font could be supplied, essentially replacing the font in the lookup tables.

Parsed text will be carried over into serialize, having the same positions, heights, widths, kerning etc, and will be written with the parsed cmap from the CFF font. As no additional characters have been introduced, each mapping should succeed.

At least thats how it looks to me at the moment. What do you think?

ronnybremer avatar Nov 11 '25 10:11 ronnybremer

At least thats how it looks to me at the moment. What do you think?

Well, I know that the font is subset, but that's not the issue: The font doesn't parse and that's either:

  1. A problem with the PDF or the program that made the PDF
  2. A problem with the byte extraction, that not the entire font bytes are extracted from the PDF
  3. A problem with the allsorts parser

What printpdf should do is take the subsetted font, parse it and reconstruct the actual text content. Then when de-serializing again, it does the reverse (taking the text, mapping it again to glyphs and building the subsetted font, which in this case would contain 100% of the original glyphs).

In emergency cases, it would be possible to fallback to the GID map that glyph-encoded PDFs should embed. However, I'm simply not relying on that to be present, so I re-construct it from the font right now (I think).

fschutt avatar Nov 11 '25 23:11 fschutt

You can rule out items 2 and 3 as follows:

  • Use mutool extract on the PDF to extract the fonts. This will allow you to compare what you extracted to a known good extraction. mutool is part of https://mupdf.com/core. It's the mupdf-tools package on Arch Linux.
  • Parse the font with allsorts-tools: allsorts dump path/to/font. If the font extracted from the PDF is a CFF font, you can pass --cff to allsorts dump to treat the file as a standalone CFF table instead of full OpenType font.

wezm avatar Nov 11 '25 23:11 wezm

If the font extracted from the PDF is a CFF font, you can pass --cff to allsorts dump to treat the file as a standalone CFF table instead of full OpenType font.

Ah, but that might be the issue. printpdf doesn't handle these cases, it always expects a full font.

fschutt avatar Nov 11 '25 23:11 fschutt

  • Use mutool extract on the PDF to extract the fonts. This will allow you to compare what you extracted to a known good extraction. mutool is part of https://mupdf.com/core. It's the mupdf-tools package on Arch Linux.

mutool finds exactly the same fonts as printpdf does:

warning: ICC support is not available
extracting font-0019.cff
extracting font-0022.cff
extracting font-0034.ttf
extracting font-0039.cff

A file on those:

font-0019.cff: data
font-0022.cff: data
font-0034.ttf: TrueType Font data, 11 tables, 1st "OS/2", 12 names, Macintosh, \251 2006 Microsoft Corporation. All Rights Reserved.FKZKQC+WebdingsFKZKQC+WebdingsFKZKQC+Webdin
font-0039.cff: data
  • Parse the font with allsorts-tools: allsorts dump path/to/font. If the font extracted from the PDF is a CFF font, you can pass --cff to allsorts dump to treat the file as a standalone CFF table instead of full OpenType font.

The result from this command (the font data I saved for debug reasons inside of printpdf after reading the embedded font stream:

- CFF:
 - version: 1.0
 - name: FKZKQC+UniversLTStd-Cn
 - num glyphs: 23
 - charset: Custom
 - variant: Type 1

 - Top DICT
  - Notice: Copyright 1987, 1991, 1994, 1998, 2002 Adobe Systems Incorporated. All Rights Reserved. Univers is a trademark of Heidelberger Druckmaschinen AG, exclusively licensed through Linotype Library GmbH, and may be registered in certain jurisdictions.
  - Weight: Regular
  - PostScript: [Integer(392)]
  - FontBBox: [Integer(-166), Integer(-250), Integer(1000), Integer(989)]
  - Charset: [Offset(335)]
  - CharStrings: [Offset(375)]
  - Private: [Offset(59), Offset(1926)]

 - encoding: Standard

 - Private DICT
  - BlueValues: [Integer(-15), Integer(15), Integer(722), Integer(15), Integer(-232), Integer(10), Integer(179), Integer(11)]
  - OtherBlues: [Integer(422), Integer(15), Integer(-628), Integer(15)]
  - FamilyBlues: [Integer(-19), Integer(19), Integer(720), Integer(21), Integer(-239), Integer(17), Integer(173), Integer(16)]
  - FamilyOtherBlues: [Integer(274), Integer(9), Integer(137), Integer(1), Integer(-630), Integer(19)]
  - StdHW: [Integer(62)]
  - StdVW: [Integer(82)]
  - StemSnapH: [Integer(62), Integer(18)]
  - StemSnapV: [Integer(82), Integer(10)]
  - DefaultWidthX: [Integer(444)]
  - NominalWidthX: [Integer(551)]
 - Local subrs: 0 (0 bytes)
 - Global subrs: 0 (0 bytes)

This info I also received when decoding the embedded font data during my first attempts. So I do assume the font stream is correctly decoded by printpdf and allsorts can parse the CFF data correctly.

ronnybremer avatar Nov 12 '25 09:11 ronnybremer

If the font extracted from the PDF is a CFF font, you can pass --cff to allsorts dump to treat the file as a standalone CFF table instead of full OpenType font.

Ah, but that might be the issue. printpdf doesn't handle these cases, it always expects a full font.

This is also where I am right now. I guess support for that has to be introduced into printpdf, as the relevant data is present in the PDF.

ronnybremer avatar Nov 12 '25 09:11 ronnybremer