borb icon indicating copy to clipboard operation
borb copied to clipboard

Support font subsetting to reduce size of pdf

Open Yang-Xijie opened this issue 2 years ago • 14 comments

Describe the bug

I want to add Chinese and Japanese in PDF. I did present Chinese and Japanese characters (は哈) successfully, but the size of output.pdf is too large (14MB).

I read the example doc and found the chapter 8.6.2 Composite fonts. I just want to render each character seperately, namely extract the font of a single character and then package these characters in PDF file. How to achieve this using borb? I wonder if there is an exact configuration in borb?

To Reproduce

Steps to reproduce the behaviour:

Download Microsoft Yahei.ttf at https://github.com/dolbydu/font/blob/master/unicode/Microsoft%20Yahei.ttf

from borb.pdf.document.document import Document
from borb.pdf.page.page import Page
from borb.pdf.canvas.layout.page_layout.multi_column_layout import SingleColumnLayout
from borb.pdf.canvas.layout.page_layout.page_layout import PageLayout
from borb.pdf.canvas.layout.text.paragraph import Paragraph
from borb.pdf.pdf import PDF
from borb.pdf.canvas.font.simple_font.true_type_font import TrueTypeFont
import time

from pathlib import Path

def print_current_time():
    print(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))

if __name__ == "__main__":

    print_current_time()

    font_path = Path(__file__).parent / "font" / "Microsoft Yahei.ttf"
    custom_font = TrueTypeFont.true_type_font_from_file(font_path)

    print_current_time()

    doc = Document()
    page = Page()
    doc.append_page(page)
    layout = SingleColumnLayout(page)
    layout.add(Paragraph("はははは哈哈", font=custom_font))

    print_current_time()

    timestamp = time.strftime("%Y_%m_%d_%H_%M_%S", time.localtime())
    pdf_name = timestamp + ".pdf"
    pdf_path = Path(__file__).parent / "pdf" / pdf_name
    with open(pdf_path, "wb") as pdf_file_handle:
        PDF.dumps(pdf_file_handle, doc)

    print_current_time()
2022-05-27 21:19:11
2022-05-27 21:19:26
2022-05-27 21:19:27
2022-05-27 21:20:02
[ 288]  .
├── [  97]  README.md
├── [ 128]  font
│   ├── [ 21M]  Microsoft Yahei.ttf
│   └── [ 74M]  PingFang.ttc
├── [1.3K]  main.py
└── [  96]  pdf
    └── [ 14M]  2022_05_27_20_49_11.pdf

Expected behaviour

The size of PDF file should be less than 1MB.

Desktop (please complete the following information):

  • OS: macOS 12.3
  • borb version 2.0.26
  • Python 3.9.5

Yang-Xijie avatar May 27 '22 13:05 Yang-Xijie

In order to reduce the size of the pdf, borb would need to perform font subsetting.

This is when a pdf contains a special "made up" font that contains only those characters that are actually used in the document.

So for instance, if you created a pdf containing only the text "Hello World" you would find a font inside the pdf that only contains the characters H, e, l, o, W, r and d.

Font subsetting is currently not supported in borb.

Kind regards, Joris Schellekens

jorisschellekens avatar May 27 '22 14:05 jorisschellekens

Thanks for your reply!

Font subsetting is such an important feature for languages with large character sets. Hope that borb will support it soon.

Yang-Xijie avatar May 27 '22 14:05 Yang-Xijie

@jorisschellekens As you use fonttools, subsetting TrueType fonts by using fonttools is simple, just see this example.

https://github.com/orklann/caprice/blob/main/caprice/font/truetype/font.py#L89

For none Latin TrueType fonts, subsetting is a important feature, since fonts in this category are always large in size.

orklann avatar Jul 09 '22 11:07 orklann

I think I may have found a way to do this.

Both of these files were created with borb, one of them contains a subset Font, and the other does not. It's going to need more tests, and running all the existing tests. But I think this may just work :-)

output_without_subsetting.pdf output_with_subsetting.pdf

jorisschellekens avatar Jul 10 '22 03:07 jorisschellekens

:heavy_check_mark: According to the PDF validator I use (vera pdf), my output is a valid PDF. :heavy_check_mark: The code has been documented, :heavy_check_mark: a test has been added to verify both the subset and not-subset document.

Next I want to try it with your particular font and code, and see whether the results still hold. If that turns out to be the case, this feature will be included in the next release.

Kind regards, Joris Schellekens

jorisschellekens avatar Jul 10 '22 07:07 jorisschellekens

Turns out I already had a test using Simhei.ttf. Same results.

  • The font-file is roughly 10Mb big.
  • Without font-subsetting the PDF (containing "你好世界") is 5.5 Mb
  • With font-subsetting the PDF is 3.2 Kb

I'm also going to attach the subset version of that PDF to this ticket, so you can verify for yourself. output_001.pdf

That means this feature will be included in the next release :mega:

Kind regards, Joris Schellekens

jorisschellekens avatar Jul 10 '22 07:07 jorisschellekens

I think I may have found a way to do this.

Both of these files were created with borb, one of them contains a subset Font, and the other does not. It's going to need more tests, and running all the existing tests. But I think this may just work :-)

output_without_subsetting.pdf output_with_subsetting.pdf

These two PDFs looks different using Preview (the default PDF viewer) on macOS 12.4.

output_without_subsetting.pdf

image

output_with_subsetting.pdf

image

It might not be the expected behaviour.

Yang-Xijie avatar Jul 10 '22 09:07 Yang-Xijie

Turns out I already had a test using Simhei.ttf. Same results.

  • The font-file is roughly 10Mb big.
  • Without font-subsetting the PDF (containing "你好世界") is 5.5 Mb
  • With font-subsetting the PDF is 3.2 Kb

I'm also going to attach the subset version of that PDF to this ticket, so you can verify for yourself. output_001.pdf

That means this feature will be included in the next release 📣

Kind regards, Joris Schellekens

The attached PDF is blank opening by Preview (the default PDF viewer) on macOS 12.4.

image

However, you said that you added "你好世界" in this PDF. It might not be the expected behavior.

Yang-Xijie avatar Jul 10 '22 09:07 Yang-Xijie

That is definitely not the expected behaviour.

It's using a substitute font (so it's claiming that it can't find the font file inside the PDF)

Can you open it in Adobe?

jorisschellekens avatar Jul 10 '22 09:07 jorisschellekens

Chrome 103.0.5060.114 (Official Build) (x86_64) on macOS 12.4

output_without_subsetting.pdf

image

output_with_subsetting.pdf

image

output_001.pdf

image

Yang-Xijie avatar Jul 10 '22 09:07 Yang-Xijie

It seems that certain standards of PDF are not satisfied.

Yang-Xijie avatar Jul 10 '22 09:07 Yang-Xijie

Adobe Acrobat Reader DC Version 2022.001.20142 on macOS 12.4

Architecture: x86_64 Processor: Intel Build: 22.1.20142.0 AGM: 4.30.117 CoolType: 6.2.1 JP2K: 2.0.6.50420

output_without_subsetting.pdf

image

output_with_subsetting.pdf

image

output_001.pdf

blank

Yang-Xijie avatar Jul 10 '22 09:07 Yang-Xijie

It is wierd that I received your comments from email but I cannot find that comment at GitHub.

image

macOS 12.4 Preview.app & Chrome.app & Safari.app

image

Yang-Xijie avatar Jul 19 '22 16:07 Yang-Xijie

After having discussed this issue with another PDF expert, it seems like the actual subsetting of the font (rather than the dictionaries in the PDF) is going awry.

Sadly, that makes this problem a bit trickier. Currently I use fonttools to do the subsetting. And I'd prefer to keep most of that functionality delegated to an external library.

jorisschellekens avatar Jul 19 '22 18:07 jorisschellekens