[Draft] Handle embedded subset fonts during parsing and later serialization
Many PDFs embed their fonts only subsetted, leading to parsing errors. This PR implements parsing for embedded TrueType and CFF (Type1C) subset fonts, including text extraction and serialization of those fonts back into the saved PDF.
Some of the code is duplicated and still needs to be refactored into functions.
I would love to get your input @fschutt, before going any further. For my testing PDFs it correctly parses the embedded fonts, parses the text (including non-standard characters) and the saved PDF looks pretty ok. Looking forward to a review from your end.
I am aware that some additional work on correctly parse all type of embedded fonts is needed in the future, however, a first step has been taken. I cleaned up a few minor things on the side, like no longer adding empty tags to the document metadata.
Furthermore, the WASM part is a bit of an uncertainty for me. I would not know, if - and how - the embedded subset fonts should be serialized for WASM use. For now, I left them out of that part.
No, it's better to wait with PRs for now. I am currently working on full HTML layout, I will likely break your code and this will make rebasing worse.
Thanks for the contribution, but I added the serialization here, so that the functions will not only benefit printpdf.
All right, makes total sense. I will wait until you are done with the HTML layout. I am happy to rebase it then, just give me a ping when you are done.
@ronnybremer Technically I'm "done" now with the migration, but I still have some issues with the HTML layout, but it's starting to look solid, I hope to be done tomorrow (html_full example):
The biggest change is that I refactored the PDF Ops surrounding text, you now only have:
/// `Tf` operator: Set font and size
/// This maps 1:1 to PDF and should be used instead of SetFontSize/SetFontSizeBuiltinFont
SetFont { font: PdfFontHandle, size: Pt },
/// `Tj`/`TJ` operators: Show text at current position
/// Font must be set first with SetFont. This maps 1:1 to PDF.
ShowText { items: Vec<TextItem> },
and:
/// Represents a positioned glyph with optional CID mapping
#[derive(Debug, Clone, PartialEq, PartialOrd, Deserialize, Serialize)]
pub struct Codepoint {
/// Glyph ID in the font
pub gid: u16,
/// Horizontal offset in thousandths of an em
pub offset: f32,
/// Optional CID for CID-keyed fonts (used for ToUnicode mapping)
pub cid: Option<String>,
}
/// Represents a text segment (decoded as a UTF-8 String) or a spacing adjustment
#[derive(Debug, Clone, PartialEq, PartialOrd, Deserialize, Serialize)]
#[serde(untagged)]
pub enum TextItem {
/// A segment of text
Text(String),
/// A spacing adjustment, in thousandths of an em
Offset(f32),
/// Positioned glyph IDs with horizontal offsets and optional CID mapping
/// This avoids the need to convert GIDs to strings during parsing
GlyphIds(Vec<Codepoint>),
}
This way the parser can directly serialize the glyph IDs from a PDF and serialize them back without ever having to decode the glyph IDs to a String (which makes your use-case of "decoding a PDF, adding some lines and saving it back" easier). Decoding the Glyph IDs back to strings during PDF reading was technically a hack.
If you could make the functions for parsing a ParsedFont as free-standing functions, that would be a lot better. But you'll now have to use azul_layout::ParsedFont and I removed the printpdf::ParsedFont, as it was a duplicate. I'll take a look tomorrow, just wanted to give an update.
Thank you for the heads up @fschutt. I will look into this next week.