PdfPig icon indicating copy to clipboard operation
PdfPig copied to clipboard

Extract fonts from PDFs

Open mhmd-azeez opened this issue 4 years ago • 11 comments

Hi,

First of all thank you for this library. It's great. I have a few PDFs whose fonts are incorrectly encoded. So I am writing an app to extract text from them. For that, I need to be able to get the fonts out of them (they are TTF). Is there any way currently for doing that?

mhmd-azeez avatar Sep 12 '20 11:09 mhmd-azeez

I'm not sure to understand your question. You want to extract some text from a pdf. But, the font were wrongly encoded?

Well, you can remove the embedded part of the font (/FontDescriptor), so when a parser is extracting the font info it would be obligated to either look the font in the system or replace the font with a default one. That would allow you to extract the text.

But, I don't think our API would allow you to do any of this things.

InusualZ avatar Sep 12 '20 14:09 InusualZ

Example PDF: book__66.pdf

This is what happens when you copy the text:

image

This is how the font is mapped:

image

This is how it's supposed to be mapped:

image

My first solution was to create a map for each character. It mostly worked. BUT! The problem is he encoding of each font is randomized, it changes depending on the PDF file. So I have to create a map for each font in every PDF file!

I am trying to extract all of the fonts from the PDF file and use find the right encoding based on the shape of the glyph (I have already done this). Is there a better way of solving this?

mhmd-azeez avatar Sep 12 '20 14:09 mhmd-azeez

Here is a map for the example font cndklb+naliregular.zip

mhmd-azeez avatar Sep 12 '20 14:09 mhmd-azeez

Same font, used in two PDF files with two different encodings!

EhqhS9KX0AEeUDh

mhmd-azeez avatar Sep 12 '20 14:09 mhmd-azeez

I don't know what encoding the program is seeing, but if the font is embedded. You can see different glyph mapped to different character code.

Character code is not the same as a char. For example, let say the font have the glyph for the letter S. When a font is embedded, the font writter may give the letter S an id (character code) which may or not be the same value as his ascii (byte) representation.

That's why you may be seeing that the Character Code 20 being mapped to the letter M in one pdf, and in another one being mapped to A

I would have to inspect the file closely too see if that's what is happening.

InusualZ avatar Sep 12 '20 15:09 InusualZ

Please do, I'd appreciate it if there is a more straightforward method for solving it

mhmd-azeez avatar Sep 12 '20 15:09 mhmd-azeez

Yeah, the font is embedded. I thought the library was able to extract the text, but I was wrong: image

Maybe @EliotJones can shed some light in here

InusualZ avatar Sep 12 '20 23:09 InusualZ

Here is something strange:

I have come across this font: CNDKMD+NaliRegular+2.zip

That have different encoding based on what library/app you use!

For example, for this glyph https://fontdrop.info/ and GlyphTypeface.CharacterMap show U+F0F9 while FontForge and PdfPg show U+02D8! font-drop

image

Do you have any idea what might be going on? How can I get consistent result?

mhmd-azeez avatar Sep 13 '20 00:09 mhmd-azeez

This is the code I am using to get all of the glyphs of a font:

var uri = "...";
var families = System.Windows.Media.Fonts.GetFontFamilies(uri);

foreach (var family in families)
{
    foreach (var typeface in family.GetTypefaces())
    {
        typeface.TryGetGlyphTypeface(out var glyph);
        var characterMap = glyph.CharacterToGlyphMap;

        foreach (KeyValuePair<int, ushort> kvp in characterMap)
        {
            var sourceChar = char.ConvertFromUtf32(kvp.Key);
        }
    }
}

mhmd-azeez avatar Sep 13 '20 00:09 mhmd-azeez

Hi, I found out some of the problems (the comments I have hidden) were caused by mutool which I was using to extract the font.

My only problem now is extracting the fonts programmatically. If PdfPig can give me a stream for each font, that'd be great

mhmd-azeez avatar Sep 14 '20 09:09 mhmd-azeez

Hi @encrypt0r I will try to take a look into exposing the font files when I get back to PdfPig.

Most of the font classes for dealing with PDF fonts are still internal in PdfPig unfortunately because I don't want to define the public API fully yet while I still need to make large internal changes.

You can access the loaded fonts for a file as follows:

 using (var document = PdfDocument.Open(GetFilename(), ParsingOptions.LenientParsingOff))
 {
     var page = document.GetPage(1);

     var cachingProvider =
         typeof(PdfDocument).GetField("cachingProviders", BindingFlags.NonPublic | BindingFlags.Instance)
             .GetValue(document);

     var resourceStore = cachingProvider.GetType().GetProperty("ResourceContainer")
         .GetGetMethod().Invoke(cachingProvider, null);

     var fontsDictionary = resourceStore.GetType()
         .GetField("loadedFonts", BindingFlags.NonPublic | BindingFlags.Instance)
         .GetValue(resourceStore);


     Assert.Equal(PageSize.Letter, page.Size);
 }

But the types of the values in the font dictionary are also internal, but you might be able to get some data you need through reflection.

EliotJones avatar Sep 14 '20 13:09 EliotJones