PdfPig
PdfPig copied to clipboard
Extract fonts from PDFs
Hi,
First of all thank you for this library. It's great. I have a few PDFs whose fonts are incorrectly encoded. So I am writing an app to extract text from them. For that, I need to be able to get the fonts out of them (they are TTF). Is there any way currently for doing that?
I'm not sure to understand your question. You want to extract some text from a pdf. But, the font were wrongly encoded?
Well, you can remove the embedded part of the font (/FontDescriptor
), so when a parser is extracting the font info it would be obligated to either look the font in the system or replace the font with a default one. That would allow you to extract the text.
But, I don't think our API would allow you to do any of this things.
Example PDF: book__66.pdf
This is what happens when you copy the text:
This is how the font is mapped:
This is how it's supposed to be mapped:
My first solution was to create a map for each character. It mostly worked. BUT! The problem is he encoding of each font is randomized, it changes depending on the PDF file. So I have to create a map for each font in every PDF file!
I am trying to extract all of the fonts from the PDF file and use find the right encoding based on the shape of the glyph (I have already done this). Is there a better way of solving this?
Here is a map for the example font cndklb+naliregular.zip
Same font, used in two PDF files with two different encodings!
I don't know what encoding the program is seeing, but if the font is embedded. You can see different glyph mapped to different character code.
Character code is not the same as a char. For example, let say the font have the glyph for the letter S
. When a font is embedded, the font writter may give the letter S
an id (character code) which may or not be the same value as his ascii (byte) representation.
That's why you may be seeing that the Character Code 20 being mapped to the letter M
in one pdf, and in another one being mapped to A
I would have to inspect the file closely too see if that's what is happening.
Please do, I'd appreciate it if there is a more straightforward method for solving it
Yeah, the font is embedded. I thought the library was able to extract the text, but I was wrong:
Maybe @EliotJones can shed some light in here
Here is something strange:
I have come across this font: CNDKMD+NaliRegular+2.zip
That have different encoding based on what library/app you use!
For example, for this glyph https://fontdrop.info/ and GlyphTypeface.CharacterMap show U+F0F9 while FontForge and PdfPg show U+02D8!
Do you have any idea what might be going on? How can I get consistent result?
This is the code I am using to get all of the glyphs of a font:
var uri = "...";
var families = System.Windows.Media.Fonts.GetFontFamilies(uri);
foreach (var family in families)
{
foreach (var typeface in family.GetTypefaces())
{
typeface.TryGetGlyphTypeface(out var glyph);
var characterMap = glyph.CharacterToGlyphMap;
foreach (KeyValuePair<int, ushort> kvp in characterMap)
{
var sourceChar = char.ConvertFromUtf32(kvp.Key);
}
}
}
Hi, I found out some of the problems (the comments I have hidden) were caused by mutool which I was using to extract the font.
My only problem now is extracting the fonts programmatically. If PdfPig can give me a stream for each font, that'd be great
Hi @encrypt0r I will try to take a look into exposing the font files when I get back to PdfPig.
Most of the font classes for dealing with PDF fonts are still internal in PdfPig unfortunately because I don't want to define the public API fully yet while I still need to make large internal changes.
You can access the loaded fonts for a file as follows:
using (var document = PdfDocument.Open(GetFilename(), ParsingOptions.LenientParsingOff))
{
var page = document.GetPage(1);
var cachingProvider =
typeof(PdfDocument).GetField("cachingProviders", BindingFlags.NonPublic | BindingFlags.Instance)
.GetValue(document);
var resourceStore = cachingProvider.GetType().GetProperty("ResourceContainer")
.GetGetMethod().Invoke(cachingProvider, null);
var fontsDictionary = resourceStore.GetType()
.GetField("loadedFonts", BindingFlags.NonPublic | BindingFlags.Instance)
.GetValue(resourceStore);
Assert.Equal(PageSize.Letter, page.Size);
}
But the types of the values in the font dictionary are also internal, but you might be able to get some data you need through reflection.