How to analyze parsing issues with an existing PDF?
I am currently using this crate to create new PDFs, which works very well. A new task requires me to use an existing PDF and add text to it (not a form, a simple PDF). So I adopted the code to parse the PDF, get the page ops and add to it. Unfortunately, the entire page was just black after saving it. So I removed my code to add ops to the page and simply do this:
let mut warnings: Vec<PdfWarnMsg> = Vec::new();
let mut pdf = PdfDocument::parse(
original_pdf,
&PdfParseOptions {
fail_on_error: true,
},
&mut warnings,
)
.map_err(|err| anyhow!(err))?;
error!("warnings: {:?}", warnings);
// remove all pages but the first
pdf.pages = pdf.pages.iter().take(1).map(|page| page.clone()).collect();
let mut page = pdf.pages[0].clone();
pdf.pages = vec![];
// add content to page - currently disabled
// return the result
Ok(pdf
.with_pages(vec![page])
.save(&PdfSaveOptions::default(), &mut Vec::new()))
expecting the saved PDF to be the first page of the original PDF. However, the page is still fully black:
In the above code the following parser warnings are returned:
warnings: [
PdfWarnMsg { page: 0, op_id: 0, severity: Warning, msg: "Type0 font C2_0 missing DescendantFonts" },
PdfWarnMsg { page: 0, op_id: 0, severity: Warning, msg: "Unknown base font: FKZKQC+UniversLTStd-Bold" },
PdfWarnMsg { page: 0, op_id: 0, severity: Warning, msg: "Unknown base font: FKZKQC+UniversLTStd" },
PdfWarnMsg { page: 0, op_id: 0, severity: Warning, msg: "Unknown base font: FKZKQC+UniversLTStd-Cn" },
PdfWarnMsg { page: 1, op_id: 0, severity: Warning, msg: "Unknown base font: FKZKQC+UniversLTStd-Bold" },
PdfWarnMsg { page: 1, op_id: 0, severity: Warning, msg: "Unknown base font: FKZKQC+UniversLTStd" },
PdfWarnMsg { page: 0, op_id: 0, severity: Error, msg: "Info: unhandled operator 'K'" },
PdfWarnMsg { page: 0, op_id: 978, severity: Error, msg: "Info: unhandled operator 'k'" },
PdfWarnMsg { page: 0, op_id: 986, severity: Error, msg: "Info: unhandled operator 'k'" },
PdfWarnMsg { page: 0, op_id: 1034, severity: Error, msg: "Warning: 'EMC' with no current_layer" },
PdfWarnMsg { page: 0, op_id: 1044, severity: Error, msg: "Warning: 'EMC' with no current_layer" },
PdfWarnMsg { page: 0, op_id: 1054, severity: Error, msg: "Warning: 'EMC' with no current_layer" },
PdfWarnMsg { page: 0, op_id: 1065, severity: Error, msg: "Warning: 'EMC' with no current_layer" },
PdfWarnMsg { page: 0, op_id: 1076, severity: Error, msg: "Warning: 'EMC' with no current_layer" },
PdfWarnMsg { page: 0, op_id: 1093, severity: Error, msg: "Warning: 'EMC' with no current_layer" },
PdfWarnMsg { page: 1, op_id: 0, severity: Error, msg: "Info: unhandled operator 'K'" },
PdfWarnMsg { page: 1, op_id: 28, severity: Error, msg: "Info: unhandled operator 'k'" },
PdfWarnMsg { page: 1, op_id: 41, severity: Error, msg: "Warning: 'EMC' with no current_layer" },
PdfWarnMsg { page: 1, op_id: 43, severity: Error, msg: "Info: unhandled operator 'k'" },
PdfWarnMsg { page: 1, op_id: 52, severity: Error, msg: "Info: unhandled operator 'k'" },
PdfWarnMsg { page: 1, op_id: 60, severity: Error, msg: "Warning: 'EMC' with no current_layer" },
PdfWarnMsg { page: 1, op_id: 62, severity: Error, msg: "Info: unhandled operator 'k'" },
PdfWarnMsg { page: 1, op_id: 70, severity: Error, msg: "Info: unhandled operator 'k'" },
PdfWarnMsg { page: 1, op_id: 78, severity: Error, msg: "Warning: 'EMC' with no current_layer" },
PdfWarnMsg { page: 1, op_id: 80, severity: Error, msg: "Info: unhandled operator 'k'" },
PdfWarnMsg { page: 1, op_id: 88, severity: Error, msg: "Info: unhandled operator 'k'" },
PdfWarnMsg { page: 1, op_id: 102, severity: Error, msg: "Warning: 'EMC' with no current_layer" },
PdfWarnMsg { page: 1, op_id: 104, severity: Error, msg: "Info: unhandled operator 'k'" },
PdfWarnMsg { page: 1, op_id: 112, severity: Error, msg: "Info: unhandled operator 'k'" }
]
My first impression is, that an outline around the page is missing its width, causing it to be infinite and therefore covering the entire page. What would be the best way to troubleshoot this? Is this more a lopdf issue?
I can provide the PDF in question, but not in a public space.
Thank you for your advice.
PdfWarnMsg { page: 1, op_id: 80, severity: Error, msg: "Info: unhandled operator 'k'" },
PdfWarnMsg { page: 0, op_id: 0, severity: Error, msg: "Info: unhandled operator 'K'" },
This is the PDF CMYK operator:
-
c m y k Ksets the color for subsequent stroking operations (outlines). -
c m y k ksets the color for subsequent non-stroking operations (fills).
Right now, I think the CMYK operations were ignored. I don't exactly remember the reason as to why.
It is likely that your PDF has some form of background rectangle, which is encoded in CMYK space. printpdf currently ignores CMYK operations, and therefore it falls back to the default fill color, which is not white, but black.
This should be fixed in printpdf, lopdf is not the cause.
Thank you for the info @fschutt. I would not know enough to fix it myself, but I will try to convert the PDF into RGB, if thats even possible. Indeed, there are two little triangles on the first page, which I assume are the once causing the issue. I'll keep you updated.
Allright. Converting the PDF from CMYK to RGB with the command
gs -sDEVICE=pdfwrite -dBATCH -dNOPAUSE -dCompatibilityLevel=1.4 -dColorConversionStrategy=/sRGB -dProcessColorModel=/DeviceRGB -sOutputFile=test.pdf test.pdf.original
worked. The PDF now shows the first page upon save.
Only issue is, all text is gone. I do assume it's because of the missing fonts. I tried to add the fonts to the PdfDocument asset list after parsing, but that hasn't helped so far. Maybe adding them to the OS will help, I will continue to do some tests.
FYI: I added the fonts directly to the resources.fonts.map by using the same FontId as printed in the warnings. Not sure if that is the correct approach or if the text got already lost during parsing.
I am open to any advice. Thanks!
I forgot to add the warnings I received from parse, which have changed since the conversion to RGB:
warnings: [
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Found font data for R13 (3464 bytes)" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully read font data" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully loaded font at index 0" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully read HEAD table" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "MAXP table: 227 glyphs" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Font has 227 glyphs" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully read LOCA table" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully read GLYF table" },
PdfWarnMsg { page: 0, op_id: 0, severity: Warning, msg: "Failed to create allsorts Font: an expected data value was missing" }, PdfWarnMsg { page: 0, op_id: 0, severity: Error, msg: "Failed to parse font data for R13" },
PdfWarnMsg { page: 0, op_id: 0, severity: Warning, msg: "Unknown base font: ECNDYC+UniversLTStd-Cn" },
PdfWarnMsg { page: 0, op_id: 0, severity: Warning, msg: "Unknown base font: DRPAUO+UniversLTStd" },
PdfWarnMsg { page: 0, op_id: 0, severity: Warning, msg: "Unknown base font: CBRIKF+UniversLTStd-Bold" },
PdfWarnMsg { page: 1, op_id: 0, severity: Warning, msg: "Unknown base font: DRPAUO+UniversLTStd" },
PdfWarnMsg { page: 1, op_id: 0, severity: Warning, msg: "Unknown base font: CBRIKF+UniversLTStd-Bold" }]
PdfWarnMsg { page: 0, op_id: 0, severity: Warning, msg: "Failed to create allsorts Font: an expected data value was missing" }, PdfWarnMsg { page: 0, op_id: 0, severity: Error, msg: "Failed to parse font data for R13" },
Mmh, that seems to be the problem. In allsorts, this is a ParseError::MissingValue, but I cannot trace where it's actually thrown.
- Try using Inkscape to import the PDF, I just want to confirm that it was the CMYK that was the "black page problem"
- Is there now anything on the page (do other Ops get exported properly)? So it's only the fonts missing, right?
- I would need to see what's up with the "R13" font.
printpdfparses it here:
https://github.com/fschutt/printpdf/blob/ba84a83e9fd2064041c0e31ab90a87b1e01f2874/src/font.rs#L1005-L1013
Clone the repository locally and just std::fs::write(font_bytes) the font to a file. I need to see why the R13 font can't be parsed.
printpdf needs to have access to the fonts to get things like the reverse glyph mapping (mapping glyphs back to characters), etc. (TODO: technically we could fall back to the PDFs glyph ID table, but still). Decoding the fonts from the PDF back is rather crucial for the functionality. The font data seems to be found, but there's an error parsing the actual font.
Sorry for the bad experience, those parsing code paths aren't very well tested.
I shall do that and give you feedback. To your questions:
- Inkscape opens it perfectly, but substitutes the fonts as I don't have them installed on my Mac. According to the document properties, no color profile was initially defined, so I assume the CMYK ops came from the initial export done in InDesign.
- Yes, all elements but the text are visible and rendered correctly on the saved PDF.
- Working on it.
And btw, I don't have a bad experience :) PDFs are harder than they originally look like and getting this workflow done is a challenge. But without your crate it would be impossible.
Ok, got the font saved. I'll upload it here.
Interestingly, though, the cloned repo panics:
thread 'tokio-runtime-worker' panicked at /home/me/printpdf/src/font.rs:1056:71:
called `Result::unwrap()` on an `Err` value: MissingValue
~~Hmm. Looks like the GIT version has an issue with my previous code (generating a new PDF from scratch). Haven't changed a bit, PDF comes out at 111k, but opening it gives "document contains only empty pages". Thought I let you know, I'll dig through the changes from 8.2 to main.~~
Please disregard. The debugging I did caused this.
I took the saved font file and tried to parse it like this:
let mut warnings: Vec<PdfWarnMsg> = Vec::new();
let tst_font_bytes = include_bytes!("/tmp/r13-font");
let tst_font =
printpdf::ParsedFont::from_bytes(tst_font_bytes, 0, &mut warnings).unwrap();
error!("warnings: {:?}", warnings);
strangely, it worked, here are the warnings:
warnings: [
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully read font data" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully loaded font at index 0" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully read HEAD table" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "MAXP table: 44 glyphs" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Font has 44 glyphs" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully read LOCA table" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully created allsorts Font" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully parsed cmap subtable" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully read HMTX data" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully read HHEA table" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Font metrics: units_per_em=2048" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully read GLYF table" },
PdfWarnMsg { page: 0, op_id: 0, severity: Warning, msg: "Failed to convert glyph 1 to OwnedGlyph from GLYF font" }, PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Successfully decoded 43 glyphs" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Font space width: 569" },
PdfWarnMsg { page: 0, op_id: 0, severity: Info, msg: "Font parsing completed successfully" }]
A single conversion error for a glyph but otherwise the font gets parsed. I'll dig a bit more into the code to see why this parameter missing is thrown.
Nope, that must have been a fluke in my testing. Could have hit a code path which overrode the /tmp/r13-font file. Digging further.
Added those debug statements to src/fonts.rs just before line 1056:
use allsorts_subset_browser::font_data::FontData::OpenType;
use allsorts_subset_browser::font_data::FontData::Woff;
use allsorts_subset_browser::font_data::FontData::Woff2;
match font_file {
OpenType(_) => println!("font_file: OpenType"),
Woff(_) => println!("font_file: Woff"),
Woff2(_) => println!("font_file: Woff2"),
}
println!("provider: {:?}", provider.table_tags());
When parsing my own TTF it prints:
font_file: OpenType
provider: Some([1330851634, 1668112752, 1668707360, 1718642541, 1735162214, 1751474532, 1751672161, 1752003704, 1819239265, 1835104368, 1851878757, 1886352244, 1886545264])
During the PDF parsing process it prints:
font_file: OpenType
provider: Some([1668707360, 1718642541, 1735162214, 1751474532, 1751672161, 1752003704, 1819239265, 1835104368, 1886545264])
So obviously a few numbers are missing there in the table tags, don't know what they mean yet.
(AI slop)
So obviously a few numbers are missing there in the table tags, don't know what they mean yet.
| Integer | 4-Byte Hex | ASCII Tag | Table Name | Status in PDF |
|---|---|---|---|---|
1330851634 |
4F532F32 |
OS/2 |
OS/2 and Windows Metrics | Missing |
1668112752 |
636d6170 |
cmap |
Character to Glyph Mapping | Missing |
1668707360 |
63767420 |
cvt |
Control Value Table | Present |
1718642541 |
6670676d |
fpgm |
Font Program | Present |
1735162214 |
676c7966 |
glyf |
Glyph Data | Missing |
1751474532 |
68656164 |
head |
Font Header | Missing |
1751672161 |
68686561 |
hhea |
Horizontal Header | Present |
1752003704 |
686d7478 |
hmtx |
Horizontal Metrics | Present |
1819239265 |
6d617870 |
maxp |
Maximum Profile | Present |
1835104368 |
6e616d65 |
name |
Naming Table | Missing |
1851878757 |
706f7374 |
post |
PostScript | Missing |
1886352244 |
70726570 |
prep |
Control Value Program | Present |
The most critical missing tables for the allsorts parser are:
-
head(Font Header): This is absolutely fundamental. It contains global information about the font, like the version number, creation date, and crucially, theunitsPerEmvalue which is essential for scaling glyphs. -
maxp(Maximum Profile): Contains the number of glyphs in the font. Without this, the parser doesn't know how many glyphs to expect. -
cmap(Character to Glyph Mapping): This table maps character codes (like the letter 'A') to glyph indices. It's essential for figuring out which glyph to draw for a given character. -
glyf(Glyph Data): This table contains the actual vector outlines for each glyph. Without this, there is no visual representation of the characters.
The parser cannot "create" the font object in memory because the foundational data it needs is absent from the byte stream it received from printpdf.
However: The fact that parsing the saved file works, but parsing the stream from the PDF fails, points to one of two possibilities:
-
Incomplete Font Subsetting: The PDF creator did not embed the full font. It performed "subsetting," which is common to reduce file size. However, it may have created an incomplete or non-compliant subset of the font, stripping out tables that
printpdf/allsortsconsiders mandatory. A more lenient parser (like the one in Inkscape or Adobe Reader) might be able to reconstruct the missing data or work around it. -
A Bug in
printpdf's Font Extraction: This is also a strong possibility. The code inprintpdfthat is responsible for finding and extracting the font data stream from the PDF's internal object structure might be flawed. It could be misinterpreting the stream's length or location, resulting in a truncated byte stream being passed toParsedFont::from_bytes. This truncated stream would lack the tables located towards the end of the original font file, which perfectly explains your observations.
Given that a standalone test with the full font file works, a bug in how printpdf is extracting the font stream from the PDF is the most likely culprit. It is not reading the entire font program from the PDF into the font_bytes slice.
Definitely subsetting. I ran the PDF through an analyzer:
Font count: 4 Byte size: 11,766 (39.24% of the file)
| object | name | type | encoding | embedded | subset | unicode | bytes |
|---|---|---|---|---|---|---|---|
| 17 0 | FKZKQC+UniversLTStd-Bold | Type 1C | WinAnsi | yes | yes | yes | 3,913 |
| 18 0 | FKZKQC+UniversLTStd | Type 1C | WinAnsi | yes | yes | yes | 3,642 |
| 24 0 | FKZKQC+Webdings | CID TrueType | Identity-H | yes | yes | yes | 2,377 |
| 25 0 | FKZKQC+UniversLTStd-Cn | Type 1C | WinAnsi | yes | yes | yes | 1,834 |
Comparing the hexdump of the saved font /tmp/r13-font with the output of the embedded font object in the analyzer, it appears to be the Webdings font having the issue. As you mentioned above, a few parts are missing indeed, for example the font object in the PDF starts with:
74 72 75 65 00 0b 00 80 00 03 00 30 4f 53 2f 32 true.... ...0OS/2 00 00 00 00 00 00 00 bc 00 00 00 56 63 76 74 20 ........ ...Vcvt
while the parsed font starts with:
00 01 00 00 00 09 00 80 00 03 00 10 63 76 74 20 ............cvt
but looking further down I would say ~80% of the hex data is identical.
It looks more and more like gs made some mistakes in converting the PDF to RGB. The font table in the analyzer of the converted file looks like this, a bit different from the original above:
Font count: 4 Byte size: 10,273 (46.04% of the file)
| object | name | type | encoding | embedded | subset | unicode | bytes |
|---|---|---|---|---|---|---|---|
| 8 0 | CBRIKF+UniversLTStd-Bold | Type 1C | WinAnsi | yes | yes | no | 3,713 |
| 10 0 | DRPAUO+UniversLTStd | Type 1C | WinAnsi | yes | yes | no | 3,451 |
| 13 0 | YMADNI+Webdings | CID TrueType | Identity-H | yes | yes | yes | 1,479 |
| 16 0 | ECNDYC+UniversLTStd-Cn | Type 1C | WinAnsi | yes | yes | no | 1,630 |
And when looking at the hexdump of the embedded Webdings font, they look identical. Wow. I will try some other way to convert the PDF to RGB.
Yes, printpdf does subset the fonts when the PDF was created with printpdf. Also the fonts included with printpdf are subsetted, but they're usually not included in the PDF. I sadly don't have that much time to debug this right now.
You have already helped a lot @fschutt! Thank you for that. I will continue my digging and keep you updated here on any findings.
I managed to come a bit further. The Cmyk issue is fixed, the rectangle is parsed correctly. No more black pages.
Fixed the font reference loading and it tries to decode the embedded fonts. However, there I am stuck. Fonts are embedded as Type1C, so I would assume they are in CFF format. Unfortunately, the allsorts font library expects those fonts to start with the magic tag `OTTO', but the embedded fonts don't. According to some research they should be normal OTF fonts and the hexdump seems to confirm that. Even the CFF header is present and intact, just the magic tag is missing.
Trying to embed the fonts again in Inkscape also doesn't work, as they are embedded the same way.
Hmm, if it's certain that the font is correctly read from the Dictionary (use a PDF debugger tool to extract the font file), then we can implement some kind of "retry" mechanism, i.e. "if the font fails to parse, try auto-fixing the first bytes and see if it parses again". I just want to confirm that this isn't a bug in printpdf, but a bug in how the font got embedded in the file.
More research and testing done. But I am finding myself in unknown territory here.
The embedded fonts are Type1C, so CFF. I can confirm, that the stream is correctly deflated and read. Running it through the CFF parser from allsorts works fine.
let scope = ReadScope::new(font_bytes);
let font_file = match scope.read::<CFF<'_>>() {
Ok(ff) => {
ff
}
Err(e) => {
return None;
}
};
println!("CFF header: {:?}", font_file.header);
font_file.name_index.iter().enumerate().for_each(|(i, n)| println!("name_idx {}: {}", i, String::from_utf8_lossy(n)));
font_file.fonts.iter().for_each(|font| {
match &font.charset {
allsorts_subset_browser::cff::Charset::ISOAdobe => println!("ISOAdobe"),
allsorts_subset_browser::cff::Charset::Expert => println!("Expert"),
allsorts_subset_browser::cff::Charset::ExpertSubset => println!("ExpertSubset"),
allsorts_subset_browser::cff::Charset::Custom(custom_charset) => println!("Custom"),
}
match &font.data {
CFFVariant::CID(ciddata) => println!("CID data"),
CFFVariant::Type1(type1_data) => {
println!("Type1 data");
match &type1_data.encoding {
Encoding::Standard => println!("standard encoding"),
Encoding::Expert => println!("expert encoding"),
Encoding::Custom(custom_encoding) => println!("custom encoding"),
}
println!("data privdict: {:#?}", type1_data.private_dict);
},
}
println!("font: {:#?}", font.top_dict);
});
Result:
CFF header: Header { major: 1, minor: 0, hdr_size: 4, off_size: 2 }
name_idx 0: FKZKQC+UniversLTStd
Custom
Type1 data
standard encoding
data privdict: Dict {
dict: [
(
BlueValues,
[
Integer(
-19,
),
Integer(
19,
),
Integer(
720,
),
Integer(
21,
),
Integer(
-239,
),
Integer(
17,
),
Integer(
173,
),
Integer(
16,
),
],
),
(
OtherBlues,
[
Integer(
274,
),
Integer(
9,
),
Integer(
137,
),
Integer(
1,
),
Integer(
-630,
),
Integer(
19,
),
],
),
(
StdHW,
[
Integer(
86,
),
],
),
(
StdVW,
[
Integer(
100,
),
],
),
(
DefaultWidthX,
[
Integer(
278,
),
],
),
(
NominalWidthX,
[
Integer(
607,
),
],
),
],
default: PhantomData<allsorts_subset_browser::cff::PrivateDictDefault>,
}
font: Dict {
dict: [
(
Notice,
[
Integer(
391,
),
],
),
(
Weight,
[
Integer(
388,
),
],
),
(
PostScript,
[
Integer(
392,
),
],
),
(
FontBBox,
[
Integer(
-168,
),
Integer(
-250,
),
Integer(
992,
),
Integer(
947,
),
],
),
(
Charset,
[
Offset(
338,
),
],
),
(
CharStrings,
[
Offset(
393,
),
],
),
(
Private,
[
Offset(
32,
),
Offset(
3972,
),
],
),
],
default: PhantomData<allsorts_subset_browser::cff::TopDictDefault>,
}
But there doesn't seem to be any function in the allsorts crate to convert this into an OpenType font, which could then provide the need tables for printpdf to work.
It should be possible, though, for example there is a somewhat older Java based converter from CFF to OpenType here
Github FontVerter
So the data is available after parsing the CFF, but basically it needs to be reconstructed into a full font. Could the allsorts crate provide this? Or am I completely off track here?
Hmm, if it's certain that the font is correctly read from the Dictionary (use a PDF debugger tool to extract the font file), then we can implement some kind of "retry" mechanism, i.e. "if the font fails to parse, try auto-fixing the first bytes and see if it parses again". I just want to confirm that this isn't a bug in printpdf, but a bug in how the font got embedded in the file.
Unfortunately, adding the OTTO tag in front of the embedded font doesn't work, as CFF is a different layout than OpenType. Tried that and failed.
Thank you @fschutt for accepting the PR. I have found a workaround for the moment, which allows me to correctly produce the PDF I need.
- parse the PDF
- embedded CFF fonts are found but not parsed
- add those fonts to the PDF with
add_font - iterate over all pages and operations, replacing the
font_idin text write and font size operations - also replacing the known unicode characters with the correct ones
- save the PDF
Without any chance of correctly parsing the embedded fonts I see no other solution. I does allow me to continue with my development, as the PDFs I use as a base are static and known.
If allsorts parses it, it should be possible to create a TTF font using a subset::whole_font - https://docs.rs/allsorts/latest/allsorts/subset/fn.whole_font.html
Amazing how you can find functions like that in the allsorts crate, even though its documentation is quite limited. I will look into it, maybe thats the missing piece. Could you please also take a look at #249? Thank you!
According to my current discussion with @wezm, I would need to alter my approach. It becomes apparent that the font can not be fully reconstructed from the embedded CFF info (which is highly logical, as its subsetted font info to begin with). So I might need to do a bit more work:
introduce a new structure EmbeddedCFFFont which represents a subset of the information of BuiltinOrParsedFont, but enough to embed it again in the resulting PDF and use it for char mapping during parsing. During serialize embed the CFF font again, unaltered, keeping the name.
The user of the crate can not add any more text with that font anyway, as it doesn't contain all glyphs. They might get away with some edge cases, where the added text falls into the subsetted cmap.
If more text needs to be added, the full font needs to be added to the PDF, just as I do at the moment as a workaround. It is, however, treated as a separate font. Alternatively, a function to replace a subsetted font with a full font could be supplied, essentially replacing the font in the lookup tables.
Parsed text will be carried over into serialize, having the same positions, heights, widths, kerning etc, and will be written with the parsed cmap from the CFF font. As no additional characters have been introduced, each mapping should succeed.
At least thats how it looks to me at the moment. What do you think?
At least thats how it looks to me at the moment. What do you think?
Well, I know that the font is subset, but that's not the issue: The font doesn't parse and that's either:
- A problem with the PDF or the program that made the PDF
- A problem with the byte extraction, that not the entire font bytes are extracted from the PDF
- A problem with the allsorts parser
What printpdf should do is take the subsetted font, parse it and reconstruct the actual text content. Then when de-serializing again, it does the reverse (taking the text, mapping it again to glyphs and building the subsetted font, which in this case would contain 100% of the original glyphs).
In emergency cases, it would be possible to fallback to the GID map that glyph-encoded PDFs should embed. However, I'm simply not relying on that to be present, so I re-construct it from the font right now (I think).
You can rule out items 2 and 3 as follows:
- Use
mutool extracton the PDF to extract the fonts. This will allow you to compare what you extracted to a known good extraction.mutoolis part of https://mupdf.com/core. It's themupdf-toolspackage on Arch Linux. - Parse the font with allsorts-tools:
allsorts dump path/to/font. If the font extracted from the PDF is a CFF font, you can pass--cfftoallsorts dumpto treat the file as a standaloneCFFtable instead of full OpenType font.
If the font extracted from the PDF is a CFF font, you can pass --cff to allsorts dump to treat the file as a standalone CFF table instead of full OpenType font.
Ah, but that might be the issue. printpdf doesn't handle these cases, it always expects a full font.
- Use
mutool extracton the PDF to extract the fonts. This will allow you to compare what you extracted to a known good extraction.mutoolis part of https://mupdf.com/core. It's themupdf-toolspackage on Arch Linux.
mutool finds exactly the same fonts as printpdf does:
warning: ICC support is not available
extracting font-0019.cff
extracting font-0022.cff
extracting font-0034.ttf
extracting font-0039.cff
A file on those:
font-0019.cff: data
font-0022.cff: data
font-0034.ttf: TrueType Font data, 11 tables, 1st "OS/2", 12 names, Macintosh, \251 2006 Microsoft Corporation. All Rights Reserved.FKZKQC+WebdingsFKZKQC+WebdingsFKZKQC+Webdin
font-0039.cff: data
- Parse the font with allsorts-tools:
allsorts dump path/to/font. If the font extracted from the PDF is a CFF font, you can pass--cfftoallsorts dumpto treat the file as a standaloneCFFtable instead of full OpenType font.
The result from this command (the font data I saved for debug reasons inside of printpdf after reading the embedded font stream:
- CFF:
- version: 1.0
- name: FKZKQC+UniversLTStd-Cn
- num glyphs: 23
- charset: Custom
- variant: Type 1
- Top DICT
- Notice: Copyright 1987, 1991, 1994, 1998, 2002 Adobe Systems Incorporated. All Rights Reserved. Univers is a trademark of Heidelberger Druckmaschinen AG, exclusively licensed through Linotype Library GmbH, and may be registered in certain jurisdictions.
- Weight: Regular
- PostScript: [Integer(392)]
- FontBBox: [Integer(-166), Integer(-250), Integer(1000), Integer(989)]
- Charset: [Offset(335)]
- CharStrings: [Offset(375)]
- Private: [Offset(59), Offset(1926)]
- encoding: Standard
- Private DICT
- BlueValues: [Integer(-15), Integer(15), Integer(722), Integer(15), Integer(-232), Integer(10), Integer(179), Integer(11)]
- OtherBlues: [Integer(422), Integer(15), Integer(-628), Integer(15)]
- FamilyBlues: [Integer(-19), Integer(19), Integer(720), Integer(21), Integer(-239), Integer(17), Integer(173), Integer(16)]
- FamilyOtherBlues: [Integer(274), Integer(9), Integer(137), Integer(1), Integer(-630), Integer(19)]
- StdHW: [Integer(62)]
- StdVW: [Integer(82)]
- StemSnapH: [Integer(62), Integer(18)]
- StemSnapV: [Integer(82), Integer(10)]
- DefaultWidthX: [Integer(444)]
- NominalWidthX: [Integer(551)]
- Local subrs: 0 (0 bytes)
- Global subrs: 0 (0 bytes)
This info I also received when decoding the embedded font data during my first attempts. So I do assume the font stream is correctly decoded by printpdf and allsorts can parse the CFF data correctly.
If the font extracted from the PDF is a CFF font, you can pass --cff to allsorts dump to treat the file as a standalone CFF table instead of full OpenType font.
Ah, but that might be the issue. printpdf doesn't handle these cases, it always expects a full font.
This is also where I am right now. I guess support for that has to be introduced into printpdf, as the relevant data is present in the PDF.