PDFGen icon indicating copy to clipboard operation
PDFGen copied to clipboard

Unicode Fonts

Open pierre1451 opened this issue 2 years ago • 18 comments

Hello Andre, thanks for making and sharing PDFGen. It's clean, small, without dependencies. Great. I see you looked into adding fonts to the PDF. Any update / beta / idea on loading and using fonts supporting unicode characters? Pierre

pierre1451 avatar Jun 07 '22 16:06 pierre1451

Hi Pierre. Unfortunately I haven't had a look at this in any real depth. I think it requires quite a lot of font-decoding code to do things like determine character bounding boxes, which I haven't looked at. Unfortunately the built-in PDF fonts are super limited in terms of character set. Do you have a specific requirement?

AndreRenaud avatar Jun 07 '22 21:06 AndreRenaud

Here's the story: we have a web app, backend in PHP and C/C++, that needs to produce PDF reports and certificates. Nothing fancy, but with some customization at user level, images, tables, and all kinds of languages. Usually few pages. We used TCPDF, got stuck, moved to wkhtmltopdf, stuck again (project stopped / CSS issues). I'm looking at headless chrome, it works, but it's enormous (268 MB), and feels like an overkill. I'm a low level developer, I like the idea of a simple API. All in all, PDFGen is really close, the ideal improvement would be to load a font (windows for me), and 'pdf_add_text' a unicode string. The other capability, but for later, might be transparency on images: I had to convert a RGBA png to RGB to get the image to render. It's very useful for watermarks and signatures.

pierre1451 avatar Jun 08 '22 03:06 pierre1451

Yeah, that sounds reasonable. At the moment I don't have loads of time to look at this feature. If you want to have a go at it, I'm able to assist, but I doubt I'll have time to write it myself in the near future.

AndreRenaud avatar Jun 08 '22 07:06 AndreRenaud

As a note on this (possibly to my own future self for implementation). There are some details on how this works in this stackoverflow answer - https://stackoverflow.com/questions/3488042/how-can-i-extract-embedded-fonts-from-a-pdf-as-valid-font-files We could probably use a hugely cut down version of STB Truetype (https://github.com/nothings/stb/blob/master/stb_truetype.h), with all of the rendering removed, to just extract the font metadata (basically we just need to work out the glyph widths). Another option would be to initially just ignore widths, so that if you're using a custom font you can't do things like word wrapping. This would be a bit poor, but at least it would let you render single lines of text in a custom font. That would mean we wouldn't need STB Truetype at all (I think). It's possible that this implementation is fairly small.

AndreRenaud avatar Jun 08 '22 23:06 AndreRenaud

Hi Andre. Doing my homework on PDF: got the spec, v1.7 (2006), got the 'hello world' working, I understand better what you did (objects / offsets). Next for me: see how a 'Hello World' with an embedded ttf font looks like.

pierre1451 avatar Jun 09 '22 03:06 pierre1451

Sounds great. If you put together any example stuff, please push it up to a branch/repo on Github and we can discuss it there. After looking at the details in the spec, I think this might be less work than I'd initially worried. I'll try and have a poke around next week if I can find some time.

AndreRenaud avatar Jun 09 '22 04:06 AndreRenaud

Hi Andre, I see you're running with this improvement! Going through STB now, good find: I looked into something like that some time ago but fell back to Windows GDI to print text in a bitmap. The ramp-up to embedded fonts in PDF is steep (for me): My ABC in Arial Narrow is reasonably small, but I have to understand the ttf format now, and how the relevant glyphs are extracted. I'd like to be more helpful, but you're running too fast! I'm offline this weekend but let's connect next week, my email is [email protected]

pierre1451 avatar Jun 10 '22 14:06 pierre1451

I had a poke around. It looks like we'd need to extract the font metadata regardless, which essentially means we would need STB TrueType. I dropped all the rendering aspects of it, and it comes in at around 1000 lines, which seems acceptable to me. There is also some question about how UTF-8 text strings get encoded inside the PDF document. It might need to become UTF16, but I'm not clear on it yet. If you've put together a 'hello world' pdf, with embedded fonts, please send it through here, or to me at [email protected].

AndreRenaud avatar Jun 11 '22 02:06 AndreRenaud

Here's a minimal HTML2PDF Blaec Hello World using headless chrome. I have cloned the ttf_font branch, still have a couple of fixes to look at with Win11/Visual Sudio (fileno, BMP size, location of file), more later Blaec_HelloWorld.pdf

pierre1451 avatar Jun 11 '22 22:06 pierre1451

Hi Andre, I added stb_truetype.h and modified main.c to load and read the font tags. Works pretty much out of the box. I suggest you give me rights to the ttf_fonts branch, but that's your project, let me know how you prefer to work. I'll email you the modified files for now.

pierre1451 avatar Jun 14 '22 03:06 pierre1451

The easiest method is for you to fork the repo under your own account, make your changes there, then issue a pull request back to this repo for the final version. I've made some changes to the ttf_fonts branch locally to try and bring things in, but I haven't got the widths working yet, or unicode encoding (so really, very little works 😄 ) I'm probably out of time to look at it for another week or so though.

I've pushed what I've got - if you run the testprog now, it will draw the text with the correct font. But the widths calculations are bogus, and it still doesn't support unicode characters. Supporting TTF fonts & supporting Unicode output are two separate things, so it's possible it's easier to get TTF working first (without unicode, still restricted to the PDFDocEncoding characters), and then deal with Unicode separately. I'm not 100% sure.

AndreRenaud avatar Jun 14 '22 04:06 AndreRenaud

In the long run, I think I'll probably end up copy/pasting the cut-down contents of stb_truetype.h inside pdfgen.c. It's a bit horrible, but I don't want to change the installation requirements for users (at the moment it's just two files, pdfgen.c & pdfgen.h, and I'd like to keep it that way).

AndreRenaud avatar Jun 14 '22 04:06 AndreRenaud

Nothing wrong with copy/paste. FPDF is the PHP version of what we'd like to get to (http://www.fpdf.org/en/download.php). The fonts folder has functions to read and write a ttf, as a subset, you may find that useful.

pierre1451 avatar Jun 15 '22 22:06 pierre1451

One quick question: Does this issue mean that accented vowels, ñ, ç, and punctuations such as ¿ or ¡ don't work? What about Greek chars? Just like the OP, my use case would be generating reports, but I need to at least support English and Spanish, with perhaps some Greek chars for referring to math symbols. Although, thinking twice, maybe I can just generate the outlines of the text from STB_truetype and make a graphical PDF instead...

cblc avatar Apr 29 '23 12:04 cblc

Hi. At this stage, since PDFGen doesn't (yet?) support embedded TTF or Type1 fonts, we're stuck with whatever characters Adobe decided to enable in their original encodings. If you have a look at appendix D of this document, you can see what is available: https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.7old.pdf

You should be good for anything in Spanish & Greek. The simplest option would be to edit utf8_to_pdfencoding in pdfgen.c to add in the characters you want - basically you have to put in the utf8 character you're sending, and the corresponding number from the Win encoding column.

Alternatively, if you're not 100% sure what to do, send me a list of the specific characters you need, and I'll try and sort it out.

AndreRenaud avatar Apr 30 '23 10:04 AndreRenaud

Here's where I am with PDF: I thought (probably wrongly) that the easiest route to get a working C/C++ API in my app was to port FPDF from PHP. Plenty of sweat and tears later (UTF8, TTF font subsetting and embedding etc.), I have a working 4000 lines cpp API. It's raw. The 'only' Windows dependency I have is the use of fontsub.h, that does the font subsetting: not sure it's a big deal, but I didn't want to dive into this thing. I'm happy to share and contribute if you want to update pdfgen.

pierre1451 avatar Apr 30 '23 23:04 pierre1451

I want to use a different font installed on my machine which is not part of the default fonts PDFGen supports. How should I do that?

Should I use stb_ttf? if yes, how? Is there any sample for it?

Here's what I get when I try to use Cambria font:

PDF Error: -22 - Unable to determine width for font 'Cambria'

LinArcX avatar Apr 26 '24 20:04 LinArcX

I want to use a different font installed on my machine which is not part of the default fonts PDFGen supports. How should I do that?

Should I use stb_ttf? if yes, how? Is there any sample for it?

Here's what I get when I try to use Cambria font:

PDF Error: -22 - Unable to determine width for font 'Cambria'

At this stage there hasn't really been significant work done on TTF support. It's not a feature of PDFGen at the moment, so there is no support for fonts outside of the current list from the PDF spec.

 * @param font New font to use. This must be one of the standard PDF fonts:
 *  Courier, Courier-Bold, Courier-BoldOblique, Courier-Oblique,
 *  Helvetica, Helvetica-Bold, Helvetica-BoldOblique, Helvetica-Oblique,
 *  Times-Roman, Times-Bold, Times-Italic, Times-BoldItalic,
 *  Symbol or ZapfDingbats

AndreRenaud avatar Apr 26 '24 21:04 AndreRenaud