borb
borb copied to clipboard
Support font subsetting to reduce size of pdf
Describe the bug
I want to add Chinese and Japanese in PDF. I did present Chinese and Japanese characters (は哈) successfully, but the size of output.pdf
is too large (14MB).
I read the example doc and found the chapter 8.6.2 Composite fonts. I just want to render each character seperately, namely extract the font of a single character and then package these characters in PDF file. How to achieve this using borb? I wonder if there is an exact configuration in borb?
To Reproduce
Steps to reproduce the behaviour:
Download Microsoft Yahei.ttf at https://github.com/dolbydu/font/blob/master/unicode/Microsoft%20Yahei.ttf
from borb.pdf.document.document import Document
from borb.pdf.page.page import Page
from borb.pdf.canvas.layout.page_layout.multi_column_layout import SingleColumnLayout
from borb.pdf.canvas.layout.page_layout.page_layout import PageLayout
from borb.pdf.canvas.layout.text.paragraph import Paragraph
from borb.pdf.pdf import PDF
from borb.pdf.canvas.font.simple_font.true_type_font import TrueTypeFont
import time
from pathlib import Path
def print_current_time():
print(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
if __name__ == "__main__":
print_current_time()
font_path = Path(__file__).parent / "font" / "Microsoft Yahei.ttf"
custom_font = TrueTypeFont.true_type_font_from_file(font_path)
print_current_time()
doc = Document()
page = Page()
doc.append_page(page)
layout = SingleColumnLayout(page)
layout.add(Paragraph("はははは哈哈", font=custom_font))
print_current_time()
timestamp = time.strftime("%Y_%m_%d_%H_%M_%S", time.localtime())
pdf_name = timestamp + ".pdf"
pdf_path = Path(__file__).parent / "pdf" / pdf_name
with open(pdf_path, "wb") as pdf_file_handle:
PDF.dumps(pdf_file_handle, doc)
print_current_time()
2022-05-27 21:19:11
2022-05-27 21:19:26
2022-05-27 21:19:27
2022-05-27 21:20:02
[ 288] .
├── [ 97] README.md
├── [ 128] font
│ ├── [ 21M] Microsoft Yahei.ttf
│ └── [ 74M] PingFang.ttc
├── [1.3K] main.py
└── [ 96] pdf
└── [ 14M] 2022_05_27_20_49_11.pdf
Expected behaviour
The size of PDF file should be less than 1MB.
Desktop (please complete the following information):
- OS: macOS 12.3
- borb version 2.0.26
- Python 3.9.5
In order to reduce the size of the pdf, borb
would need to perform font subsetting.
This is when a pdf contains a special "made up" font that contains only those characters that are actually used in the document.
So for instance, if you created a pdf containing only the text "Hello World" you would find a font inside the pdf that only contains the characters H, e, l, o, W, r and d.
Font subsetting is currently not supported in borb
.
Kind regards, Joris Schellekens
Thanks for your reply!
Font subsetting is such an important feature for languages with large character sets. Hope that borb
will support it soon.
@jorisschellekens As you use fonttools, subsetting TrueType fonts by using fonttools is simple, just see this example.
https://github.com/orklann/caprice/blob/main/caprice/font/truetype/font.py#L89
For none Latin TrueType fonts, subsetting is a important feature, since fonts in this category are always large in size.
I think I may have found a way to do this.
Both of these files were created with borb
, one of them contains a subset Font
, and the other does not.
It's going to need more tests, and running all the existing tests. But I think this may just work :-)
:heavy_check_mark: According to the PDF validator I use (vera pdf), my output is a valid PDF. :heavy_check_mark: The code has been documented, :heavy_check_mark: a test has been added to verify both the subset and not-subset document.
Next I want to try it with your particular font and code, and see whether the results still hold. If that turns out to be the case, this feature will be included in the next release.
Kind regards, Joris Schellekens
Turns out I already had a test using Simhei.ttf
.
Same results.
- The font-file is roughly 10Mb big.
- Without font-subsetting the PDF (containing "你好世界") is 5.5 Mb
- With font-subsetting the PDF is 3.2 Kb
I'm also going to attach the subset version of that PDF to this ticket, so you can verify for yourself. output_001.pdf
That means this feature will be included in the next release :mega:
Kind regards, Joris Schellekens
I think I may have found a way to do this.
Both of these files were created with
borb
, one of them contains a subsetFont
, and the other does not. It's going to need more tests, and running all the existing tests. But I think this may just work :-)
These two PDFs looks different using Preview (the default PDF viewer) on macOS 12.4.
output_without_subsetting.pdf
data:image/s3,"s3://crabby-images/7a149/7a14992022a7bb25d240e5f8f90225fd509c6d06" alt="image"
output_with_subsetting.pdf
data:image/s3,"s3://crabby-images/76db7/76db727c3327b5b7d010f63a2484b53162c7af67" alt="image"
It might not be the expected behaviour.
Turns out I already had a test using
Simhei.ttf
. Same results.
- The font-file is roughly 10Mb big.
- Without font-subsetting the PDF (containing "你好世界") is 5.5 Mb
- With font-subsetting the PDF is 3.2 Kb
I'm also going to attach the subset version of that PDF to this ticket, so you can verify for yourself. output_001.pdf
That means this feature will be included in the next release 📣
Kind regards, Joris Schellekens
The attached PDF is blank opening by Preview (the default PDF viewer) on macOS 12.4.
data:image/s3,"s3://crabby-images/26a6c/26a6c85b19c2ed5f4f5cd05ae696ca30af8e3c05" alt="image"
However, you said that you added "你好世界" in this PDF. It might not be the expected behavior.
That is definitely not the expected behaviour.
It's using a substitute font (so it's claiming that it can't find the font file inside the PDF)
Can you open it in Adobe?
Chrome 103.0.5060.114 (Official Build) (x86_64) on macOS 12.4
output_without_subsetting.pdf
data:image/s3,"s3://crabby-images/0ca38/0ca389691229fe00cafa8c82590f09b4033afddb" alt="image"
output_with_subsetting.pdf
data:image/s3,"s3://crabby-images/daab5/daab5182e0a4925730673f2ba8a3eeeeca12c84d" alt="image"
output_001.pdf
data:image/s3,"s3://crabby-images/2a83b/2a83b01bde79370ecc919868b7d2431bd9c7e4a8" alt="image"
It seems that certain standards of PDF are not satisfied.
Adobe Acrobat Reader DC Version 2022.001.20142 on macOS 12.4
Architecture: x86_64 Processor: Intel Build: 22.1.20142.0 AGM: 4.30.117 CoolType: 6.2.1 JP2K: 2.0.6.50420
output_without_subsetting.pdf
data:image/s3,"s3://crabby-images/6e517/6e517571955c9f21d3e8e1fd720b1bf39363b5f4" alt="image"
output_with_subsetting.pdf
data:image/s3,"s3://crabby-images/70d3d/70d3dc8fa6489198ce103174b48d84e7067239db" alt="image"
output_001.pdf
blank
It is wierd that I received your comments from email but I cannot find that comment at GitHub.
data:image/s3,"s3://crabby-images/70ad0/70ad0794a25b591de897201b57ef01c1c61aa0e5" alt="image"
macOS 12.4 Preview.app & Chrome.app & Safari.app
data:image/s3,"s3://crabby-images/7aa26/7aa261e88daaa88ea79dfb84916ef2cd15944773" alt="image"
After having discussed this issue with another PDF expert, it seems like the actual subsetting of the font (rather than the dictionaries in the PDF) is going awry.
Sadly, that makes this problem a bit trickier. Currently I use fonttools
to do the subsetting. And I'd prefer to keep most of that functionality delegated to an external library.