Take a long time after choose an unicode font
Hi Everyone, First of all, I'm sorry if someone asked help in the same subject and for my terrible english xD.
I have to convert two txt files to one pdf. Each one has a different codification, one ISO-8859-1, other UTF-8. When I use only one file with ISO-8859-1, I generated a pdf fastly and ok. When I try from two, I have a trouble, because I need to use Unicode font, so, I'm using DejaVu font, and now, when I try to generate a new pdf, they take a long time, and not finish. I try to use font from especific directory, no one errror is displayed, i need to force pause, because it never ends. In the begin, I was using FPDF version 1, and a warning is displayed: "UserWarning: cmap value too big/small", and it take a long time too. I checked this version is deprecated, so now i'm here with version 2, with the same problem, (but the userwarning is not displayed)
So that's it, I'm waiting for an answer.
Thanks a lot for all contributtor
Hi @diego-insaurralde and welcome! 😊
I'd be happy to try to help you, but could you please provide some minimal Python code as a starting point, so that I can reproduce what you are trying to do?
If you don't know how to craft some minimal code reproducing your issue, you will find guidelines there: https://stackoverflow.com/help/minimal-reproducible-example
Hello @Lucas-C , I appreciate you to give me attention.
Sorry for let you waiting, but let me explain now.
I rewrote all the code just to show you, and I will use a example txt where it only contains letter "Ó". Please, desconsider all redundancies in the code, I tried to reproduce the same behavior that I need.
IIn PdfGeneratorClass line 24, I add an unicode font, I chose DejaVu, and I put the all fonts in the fonts folder , in the same directory (just to test).
In the first test, I will read a txt file with UTF-8 codification, but I will encode using ISO-8859-1, In line 27 and 55, I set a builtin font "Times".
In the second file, I have some functions, to read, prepare and generate pdf. In the line 66, I'm reading the file with ISO-8859-1 encoding (to the first test).
In this test, I have generated a PDF file in approximately 5 seconds, but I generated a invalid file, because i can't convert all characteres properly.
To te second test, I set an unicode font, DejaVu, and line 66, I change to encode using UTF-8. It take a more time, more than 2 minutes, and I generated the expected file. But I need to handle large files.
How Can I handle it. I hopefully have been clearly. If any question, let me know. I will attach a txt file, it contents, doesn't matter.
First it's my class PDF Generator, that use FPDF:
from fpdf import FPDF
from typing import List
class PdfGenerator(FPDF):
def __init__(
self,
header_btb: List[str],
registers: List[str],
totals: dict,
initial_page: int,
initial_seq_number: int,
):
self.header_btb = header_btb
self.registers = registers
self.totals = totals
self.initial_page = initial_page
self.initial_seq_number = initial_seq_number
self.line_separator = 1
self.line_height = 3
FPDF.__init__(self)
self.add_font('DejaVuSans', '', 'fonts/DejaVuSans.ttf')
def header(self) -> None:
self.set_font("Times", "", 5)
self.set_text_color(19, 16, 196)
self.ln(2)
self.header_btb[2] = (
self.header_btb[2][:163] + f"{self.initial_page + self.page_no() - 1:>7d}"
)
for line in self.header_btb:
if "------" in line:
break
self.cell(0, self.line_height, line, 0, self.line_separator)
self.set_text_color(128, 128, 128)
self.cell(
0,
self.line_height,
f"{' ' * 5}{'_' * 141}{' ' * 5}",
0,
self.line_separator,
)
self.ln(1)
self.set_text_color(19, 16, 196)
def generate_report(self):
sep_total = 128 * " " + "-" * 42
self.set_margins(left=0.5, right=0.5, top=0.3)
self.alias_nb_pages()
self.add_page()
self.set_font("Times", "", 5)
self.set_text_color(19, 16, 196)
for index, header in enumerate(self.registers):
for item in header["register_contents"]:
self.cell(0, self.line_height, item, 0, self.line_separator)
self.set_text_color(128, 128, 128)
self.cell(0, self.line_height, sep_total, 0, self.line_separator)
# self.line(142, self.get_y(), 170, self.get_y())
self.set_text_color(19, 16, 196)
self.cell(0, self.line_height, header["total_line"], 0, self.line_separator)
self.set_text_color(19, 16, 196)
self.cell(0, self.line_height, self.totals["total"], 0, self.line_separator)
Second File to handle txt file and generate pdf.:
from pdf_gen import PdfGenerator
from werkzeug.datastructures import FileStorage
import io
def separate_header_content(file: io.TextIOWrapper):
j = 0
header = []
contents = []
isHeaderBuilding = False
isNewContent = True
numericNew = 0
registers = []
for line in file:
if "Cabe" in line:
j = 1
if not header:
isHeaderBuilding = True
header.append(line)
else:
isHeaderBuilding = False
elif j < 8:
if isHeaderBuilding:
header.append(line)
j+=1
elif "---" in line:
continue
elif "Total" in line:
total_line = line
else:
numeric = line[1:8].strip()
if numeric.isnumeric():
if isNewContent:
isNewContent = False
numericNew = numeric
contents.append(line)
else:
if numeric == numericNew:
contents.append(line)
else:
old_contents = contents.copy()
registers.append({
"register_contents": old_contents,
"total_line": total_line
})
contents.clear()
isNewContent = True
return header,registers
def test_pdf(file_utf8, file_latin1=""):
input("PRESS ANY BUTTON TO START")
with open(file_utf8, "rb") as f:
f1 = FileStorage(f)
f1_buff = io.TextIOWrapper(f1, encoding="ISO-8859-1") #encoding="utf-8-sig")
header, registers = separate_header_content(f1_buff)
for line in header:
print(line)
if file_latin1:
f2 = open(file_latin1, "rb")
f2 = FileStorage(f2)
f2_buff = io.TextIOWrapper(f2, encoding="ISO-8859-1")
header2, registers2 = separate_header_content(f2_buff)
header += header2
registers += registers2
pdf = PdfGenerator(header, registers,{"total": "0,0", "accumulated": "0,0"} , 1, 1)
pdf.generate_report()
pdf.output(name="teste.pdf")
if file_latin1:
f2.close()
if __name__ == "__main__":
file_utf8 = "utf8.txt"
file_latin1 = ""
test_pdf(file_utf8)
@diego-insaurralde, have you tried profiling your code?
import cProfile
cProfile.run('test_pdf(file_utf8)')
This would give you (and us) a first idea of where your program is spending all that time.
@diego-insaurralde could you please execute the command pip freeze and send us the output?
@diego-insaurralde, have you tried profiling your code?
import cProfile cProfile.run('test_pdf(file_utf8)')This would give you (and us) a first idea of where your program is spending all that time.
I did it now, and I get this result. The first one I organize the data in pandas and sort by cumtime and get head 50. The second, I sorted by tottime and get head 50
243122522 function calls (243101129 primitive calls) in 177.024 seconds
ncalls | tottime | percall | cumtime | percall2 | filename:lineno(function)
---------------------------------------------------------------------------------------------------------------------
1 | 0.000 | 0.000 | 177.063 | 177.063 | <string>:1(<module>)
1 | 0.000 | 0.000 | 177.063 | 177.063 | main.py:62(test_pdf)
12/1 | 0.000 | 0.000 | 177.063 | 177.063 | {built-inmethodbuiltins.exec}
1 | 0.111 | 0.111 | 175.562 | 175.562 | pdf_gen.py:38(generate_report)
44638/39369 | 0.207 | 0.000 | 175.27 | 0.004 | fpdf.py:213(wrapper)
43678/39367 | 0.951 | 0.000 | 175.08 | 0.004 | fpdf.py:2660(cell)
43678/39367 | 2.204 | 0.000 | 170.485 | 0.004 | fpdf.py:2780(_render_styled_text_line)
43678 | 0.096 | 0.000 | 157.456 | 0.004 | line_break.py:204(render_pdf_text)
43678 | 12.979 | 0.000 | 157.282 | 0.004 | line_break.py:213(render_pdf_text_ttf)
7088728 | 6.944 | 0.000 | 139.757 | 0.000 | fonts.py:395(pick)
7088746 | 20.137 | 0.000 | 93.908 | 0.000 | fonts.py:414(get_glyph)
7054707 | 8.163 | 0.000 | 63.311 | 0.000 | ttFont.py:806(getBestCmap)
7060696/7060686 | 8.310 | 0.000 | 40.54 | 0.000 | ttFont.py:444(__getitem__)
7088728 | 10.464 | 0.000 | 38.905 | 0.000 | fonts.py:401(pick_glyph)
14150556 | 13.178 | 0.000 | 36.431 | 0.000 | {method'get'of'dict'objects}
7054707 | 5.984 | 0.000 | 14.736 | 0.000 | _c_m_a_p.py:78(getBestCmap)
7060941 | 5.555 | 0.000 | 12.305 | 0.000 | textTools.py:19(__new__)
14097404 | 11.587 | 0.000 | 11.587 | 0.000 | <string>:2(__eq__)
14097894 | 7.768 | 0.000 | 10.581 | 0.000 | <string>:2(__hash__)
43678/39367 | 0.051 | 0.000 | 9.5 | 0.000 | fpdf.py:3277(_perform_page_break_if_need_be)
7063336 | 5.825 | 0.000 | 9.476 | 0.000 | textTools.py:25(__eq__)
480 | 0.016 | 0.000 | 9.421 | 0.020 | fpdf.py:817(add_page)
479 | 0.002 | 0.000 | 9.4 | 0.020 | fpdf.py:3290(_perform_page_break)
480 | 0.016 | 0.000 | 9.362 | 0.020 | pdf_gen.py:21(header)
7054707 | 8.752 | 0.000 | 8.752 | 0.000 | _c_m_a_p.py:59(getcmap)
7048789 | 8.678 | 0.000 | 8.678 | 0.000 | <string>:2(__init__)
87356 | 0.649 | 0.000 | 8.35 | 0.000 | line_break.py:159(get_width)
14124277 | 4.803 | 0.000 | 7.655 | 0.000 | textTools.py:13(transcode)
87356 | 0.284 | 0.000 | 7.337 | 0.000 | fonts.py:197(get_text_width)
87394 | 1.653 | 0.000 | 7.026 | 0.000 | {built-inmethodbuiltins.sum}
14264812 | 4.322 | 0.000 | 5.373 | 0.000 | fonts.py:200(<genexpr>)
23100135 | 4.085 | 0.000 | 4.12 | 0.000 | {built-inmethodbuiltins.isinstance}
7060998 | 3.449 | 0.000 | 3.449 | 0.000 | textTools.py:28(__hash__)
14097894 | 2.813 | 0.000 | 2.813 | 0.000 | {built-inmethodbuiltins.hash}
7114366 | 2.781 | 0.000 | 2.781 | 0.000 | {built-inmethod__new__oftypeobjectat0x7fb22b487a00}
43678 | 0.623 | 0.000 | 2.707 | 0.000 | deprecation.py:44(get_stack_level)
21266217 | 2.013 | 0.000 | 2.013 | 0.000 | {built-inmethodbuiltins.ord}
7394474 | 1.912 | 0.000 | 1.912 | 0.000 | line_break.py:49(font)
174712 | 0.654 | 0.000 | 1.35 | 0.000 | inspect.py:896(getfile)
7048948 | 1.067 | 0.000 | 1.067 | 0.000 | {built-inmethodbuiltins.chr}
7088770 | 0.937 | 0.000 | 0.937 | 0.000 | {method'keys'of'dict'objects}
43678 | 0.140 | 0.000 | 0.773 | 0.000 | drawing.py:211(serialize)
141917/138991 | 0.217 | 0.000 | 0.73 | 0.000 | {method'join'of'str'objects}
1 | 0.560 | 0.560 | 0.56 | 0.560 | {built-inmethodbuiltins.input}
87356 | 0.333 | 0.000 | 0.542 | 0.000 | <frozenposixpath>:150(dirname)
43678 | 0.048 | 0.000 | 0.521 | 0.000 | fpdf.py:3090(_preload_font_styles)
1 | 0.000 | 0.000 | 0.513 | 0.513 | fpdf.py:4746(output)
174712 | 0.107 | 0.000 | 0.504 | 0.000 | drawing.py:212(<genexpr>)
1 | 0.008 | 0.008 | 0.474 | 0.474 | output.py:344(bufferize)
43678 | 0.182 | 0.000 | 0.473 | 0.000 | fpdf.py:3122(_parse_chars)
ncalls | tottime | percall | cumtime | percall2 | filename:lineno(function)
---------------------------------------------------------------------------------------------------------------------
7088746 | 20.137 | 0.000 | 93.908 | 0.000 | fonts.py:414(get_glyph)
14150556 | 13.178 | 0.000 | 36.431 | 0.000 | {method'get'of'dict'objects}
43678 | 12.979 | 0.000 | 157.282 | 0.004 | line_break.py:213(render_pdf_text_ttf)
14097404 | 11.587 | 0.000 | 11.587 | 0.000 | <string>:2(__eq__)
7088728 | 10.464 | 0.000 | 38.905 | 0.000 | fonts.py:401(pick_glyph)
7054707 | 8.752 | 0.000 | 8.752 | 0.000 | _c_m_a_p.py:59(getcmap)
7048789 | 8.678 | 0.000 | 8.678 | 0.000 | <string>:2(__init__)
7060696/7060686 | 8.31 | 0.000 | 40.54 | 0.000 | ttFont.py:444(__getitem__)
7054707 | 8.163 | 0.000 | 63.311 | 0.000 | ttFont.py:806(getBestCmap)
14097894 | 7.768 | 0.000 | 10.581 | 0.000 | <string>:2(__hash__)
7088728 | 6.944 | 0.000 | 139.757 | 0.000 | fonts.py:395(pick)
7054707 | 5.984 | 0.000 | 14.736 | 0.000 | _c_m_a_p.py:78(getBestCmap)
7063336 | 5.825 | 0.000 | 9.476 | 0.000 | textTools.py:25(__eq__)
7060941 | 5.555 | 0.000 | 12.305 | 0.000 | textTools.py:19(__new__)
14124277 | 4.803 | 0.000 | 7.655 | 0.000 | textTools.py:13(transcode)
14264812 | 4.322 | 0.000 | 5.373 | 0.000 | fonts.py:200(<genexpr>)
23100135 | 4.085 | 0.000 | 4.12 | 0.000 | {built-inmethodbuiltins.isinstance}
7060998 | 3.449 | 0.000 | 3.449 | 0.000 | textTools.py:28(__hash__)
14097894 | 2.813 | 0.000 | 2.813 | 0.000 | {built-inmethodbuiltins.hash}
7114366 | 2.781 | 0.000 | 2.781 | 0.000 | {built-inmethod__new__oftypeobjectat0x7fb22b487a00}
43678/39367 | 2.204 | 0.000 | 170.485 | 0.004 | fpdf.py:2780(_render_styled_text_line)
21266217 | 2.013 | 0.000 | 2.013 | 0.000 | {built-inmethodbuiltins.ord}
7394474 | 1.912 | 0.000 | 1.912 | 0.000 | line_break.py:49(font)
87394 | 1.653 | 0.000 | 7.026 | 0.000 | {built-inmethodbuiltins.sum}
7048948 | 1.067 | 0.000 | 1.067 | 0.000 | {built-inmethodbuiltins.chr}
43678/39367 | 0.951 | 0.000 | 175.08 | 0.004 | fpdf.py:2660(cell)
7088770 | 0.937 | 0.000 | 0.937 | 0.000 | {method'keys'of'dict'objects}
174712 | 0.654 | 0.000 | 1.35 | 0.000 | inspect.py:896(getfile)
87356 | 0.649 | 0.000 | 8.35 | 0.000 | line_break.py:159(get_width)
43678 | 0.623 | 0.000 | 2.707 | 0.000 | deprecation.py:44(get_stack_level)
1 | 0.56 | 0.560 | 0.56 | 0.560 | {built-inmethodbuiltins.input}
131034 | 0.353 | 0.000 | 0.397 | 0.000 | drawing.py:90(number_to_str)
90784 | 0.338 | 0.000 | 0.341 | 0.000 | {method'encode'of'str'objects}
87356 | 0.333 | 0.000 | 0.542 | 0.000 | <frozenposixpath>:150(dirname)
45118 | 0.293 | 0.000 | 0.347 | 0.000 | fpdf.py:4316(_out)
87356 | 0.284 | 0.000 | 7.337 | 0.000 | fonts.py:197(get_text_width)
43678 | 0.283 | 0.000 | 0.283 | 0.000 | {built-inmethod_warnings.warn}
87356 | 0.241 | 0.000 | 0.241 | 0.000 | line_break.py:69(font_size_pt)
141917/138991 | 0.217 | 0.000 | 0.73 | 0.000 | {method'join'of'str'objects}
44638/39369 | 0.207 | 0.000 | 175.27 | 0.004 | fpdf.py:213(wrapper)
482 | 0.194 | 0.000 | 0.194 | 0.000 | {built-inmethodzlib.compress}
43678 | 0.182 | 0.000 | 0.473 | 0.000 | fpdf.py:3122(_parse_chars)
1 | 0.173 | 0.173 | 0.219 | 0.219 | main.py:5(separate_header_content)
43678 | 0.14 | 0.000 | 0.773 | 0.000 | drawing.py:211(serialize)
43678 | 0.138 | 0.000 | 0.145 | 0.000 | line_break.py:27(__init__)
262548 | 0.123 | 0.000 | 0.123 | 0.000 | graphics_state.py:141(current_font)
43678 | 0.112 | 0.000 | 0.215 | 0.000 | util.py:18(escape_parens)
1 | 0.111 | 0.111 | 175.562 | 175.562 | pdf_gen.py:38(generate_report)
87356 | 0.11 | 0.000 | 0.172 | 0.000 | fpdf.py:414(is_ttf_font)
174712 | 0.107 | 0.000 | 0.504 | 0.000 | drawing.py:212(<genexpr>)
@diego-insaurralde could you please execute the command
pip freezeand send us the output?
Sure! But the main libraries is FLASK and FPDF2.
blinker==1.6.2
click==8.1.7
defusedxml==0.7.1
Flask==2.3.3
fonttools==4.42.1
fpdf2==2.7.5
itsdangerous==2.1.2
Jinja2==3.1.2
MarkupSafe==2.1.3
numpy==1.25.2
pandas==2.0.3
Pillow==10.0.0
python-dateutil==2.8.2
pytz==2023.3
six==1.16.0
tabulate==0.9.0
tzdata==2023.3
Werkzeug==2.3.7
Hi @diego-insaurralde
Thank you for taking the time to provide some code reproducing your problem.
First, I want to mention that the script you provided is not MINIMAL: a minimal reproducing program would be less than 20 lines long, yours is more than 160 lines long. Taking the time to narrow down the problem to the shortest code possible is very important in order to understand it 😊
Second, when strictly following your instructions, I got an IndexError raised.
You wrote:
I rewrote all the code just to show you, and I will use a example txt where it only contains letter "Ó".
But when I executed those commands in my Linux shell:
echo 'Ó' > utf8.txt
time python issue_907.py
It seems that your PdfGenerator.header() method does not handle well single-letter files.
Nevertheless I tested your script by feeding it itself (__file__ instead of "utf8.txt"),
and I'm sorry but I do not reproduce the situation you described with an execution time of 2 minutes 😞
On my computer, the script execution takes less than 2 seconds.
There is the single-file Python script I used, based on the code snippets you shared: issue_907.py.txt
Hi @diego-insaurralde
Thank you for taking the time to provide some code reproducing your problem.
First, I want to mention that the script you provided is not MINIMAL: a minimal reproducing program would be less than 20 lines long, yours is more than 160 lines long. Taking the time to narrow down the problem to the shortest code possible is very important in order to understand it 😊
Second, when strictly following your instructions, I got an
IndexErrorraised. You wrote:I rewrote all the code just to show you, and I will use a example txt where it only contains letter "Ó".
But when I executed those commands in my Linux shell:
echo 'Ó' > utf8.txt time python issue_907.pyIt seems that your
PdfGenerator.header()method does not handle well single-letter files.Nevertheless I tested your script by feeding it itself (
__file__instead of"utf8.txt"), and I'm sorry but I do not reproduce the situation you described with an execution time of 2 minutes 😞 On my computer, the script execution takes less than 2 seconds.There is the single-file Python script I used, based on the code snippets you shared: issue_907.py.txt
Hello Lucas, thanks for your answer. Sorry for my long code, I reproduced other with less lines xD. You don't have a problem with the execution because you don't change the font to unicode font (DejaVu) and you read the file utf8.txt with encoding ISO-8859-1. The code below is right and ready to execution. ps: I tried to reduce less 20 lines xD.
from fpdf import FPDF
from typing import List
from werkzeug.datastructures import FileStorage
import io
class PdfGenerator(FPDF):
def __init__(self, registers: List[str]):
self.registers = registers
FPDF.__init__(self)
self.add_font('DejaVuSans', '', 'fonts/DejaVuSans.ttf')
def generate_report(self):
self.set_margins(left = 0.5, right=0.5, top = 0.3)
self.alias_nb_pages()
self.add_page()
self.set_font('DejaVuSans', '', 5)
self.set_text_color(19,16,196)
for register in self.registers:
self.cell(0, 3, register, 0, 1)
def test_pdf(file_utf8):
with open(file_utf8, "rb") as f:
f1 = FileStorage(f)
f1_buff = io.TextIOWrapper(f1, encoding="utf-8")
lines = f1_buff.readlines()
pdf = PdfGenerator(lines )
pdf.generate_report()
pdf.output(name="teste.pdf")
if __name__ == "__main__":
file_utf8 = "utf8.txt"
test_pdf(file_utf8)
A patch by @andersonhc has just been merged which should reduce the time required for this specific example by roughly 30%. It is possible that we may eventually find more potential optimizations, but you really shouldn't expect any wonders.
When using the built-in fonts, then the text pretty much gets written directly to the PDF file, with hardly any processing overhead. When using TTF fonts, there's quite some work to be done behind the scenes, from actually reading the font file, embedding the font data into the PDF file, and converting Unicode code points to index positions relative to the embedded data. The amount of data that needs to be output will also increase by a factor of up to 6 or 7. There's nothing we can do about this part, that's just how the PDF format works.
Since we're talking about an 11+ MB file with ~100k lines of text and ~7 million characters here, any pure Python solution will eventually reach its limits. Maybe you can get it to work somehow, but it may also be that you'll have to find a package written in a compiled language that does things a lot faster.
As to your original description of the production job running "forever", have you tried to instrument your code so it will give you some progress feedback? If you eg. write some console output for each new page, then this would give you a better idea about how close you are to a working solution, or how far away from it.
[gmischler: removed pointless fullquote]
I understand it. Is it not possible to add a Unicode font as a native option, similar to fonts in the ISO-8859-1 encoding? From my perspective, as someone external to the situation, it seems unusual that this library doesn't handle UTF-8 encodings very well, once this encoding is widely used.
Is it not possible to add a Unicode font as a native option, similar to fonts in the ISO-8859-1 encoding?
Short answer: No. I've outlined some of the practical reasons above. A latin-1 font can contain at most 256 characters via 8-bit addressing. A Unicode font can potentially contain up to 4294967296 characters via 32 bit addressing. Dealing with the latter format simply takes a lot more processing. There's no way around that.
From my perspective, as someone external to the situation, it seems unusual that this library doesn't handle UTF-8 encodings very well, once this encoding is widely used.
That statement makes no sense. Unicode file encodings like UTF-8 are completely irrelevant in this context and fpdf2 never even sees them in normal operation. At the same time, no other Python based PDF library can handle Unicode text with TTF fonts as completely and correctly as fpdf2, and none of them is significantly faster.
Maybe you should try to get at least some basic understanding of the relevant topics (Unicode, TTF fonts, and the PDF specification) before criticizing the work of other people.
In summary, I repeat: it is likely impossible for any Python library to deliver the performance you wish for.
I agree overall with @gmischler answer.
However, I think that @diego-insaurralde report is very valuable to know the current performance limitations of fpdf2, based on real-world usages by users.
For that, thank you @diego-insaurralde for reaching out on this subject and opening this issue 😊👍
Thank you also for providing a 2nd, shorter code snippet. With this code snippet I was able to reproduce on my computer the performance issue you described.
I took the time to investigate a little to try to figure out where is the bottlneck. This is the performance report I produced:
$ time ./issue_907.py
Font MPDFAA+DejaVuSansBook is missing the following glyphs:
Stage | Total exec time (ms) | #execs | Average single exec duration (ms)
------------------------------------|----------------------|---------|----------------------------------
_render_styled_text_line() | 77839 | 103555 | 0.75
frag.render_pdf_text_ttf() | 71336 | 103555 | 0.69
self.font.subset.pick(ord(char)) | 51185 | 7795892 | 0.01
real 1m21,678s
The last line in the table above is a specific code line that is evaluated almost 8 million times: https://github.com/py-pdf/fpdf2/blob/2.7.5/fpdf/line_break.py#L217
With an average of 0.01ms per execution, this code is already very fast.
But I wonder if things couldn't be improved a little bit...
For example I see repeated calls to ttfont.getBestCmap() in the code.
Maybe that could be optimized / cached?
And finally, I think it may be interesting to translate this usage scenario into a non-regression performance test,
similar to the ones we already have that uses @ensure_exec_time_below
A patch by @andersonhc has just been merged which should reduce the time required for this specific example by roughly 30%.
Oh, I missed that the repeated calls to ttfont.getBestCmap() have already been removed by @andersonhc in PR #908 / commit b74a5906e4f114557ae26316ab5298c51ad6e858 😅
Well done 👍
Anyway I opened #911 and I'd be happy to have your reviews on it, and maybe more suggestions of performance improvements?
I also compared the perfomances of the script provided by @diego-insaurralde (this one: issue_907.py) between fpdf2 v2.7.4 and fpdf2 v2.7.5.
On my machine, I get those results:
- with
fpdf2v2.7.4:
$ time ./issue_907.py
real 0m10,501s
- with
fpdf2v2.7.5 (more precisely the current state of thisgitrepomasterbranch including the recent patch by @andersonhc):
$ time ./issue_907.py
real 0m30,115s
So I think we got an actual speed regression in fpdf2 v2.7.5 😢
I'm sorry @gmischler, but this shows that your previous statements were maybe too categorical:
Dealing with the latter format simply takes a lot more processing. There's no way around that. [...] In summary, I repeat: it is likely impossible for any Python library to deliver the performance you wish for.
Given that, in this specific usage scenario, fpdf2 was faster in v2.7.4, there may be still room for improvement 😊
And having performance non-regression tests would definitively be helpful to avoid that in the future.
First of all, I would like to apologize, I didn't mean to criticize anything, but just to provoke you, to get a better undestand. After your explanation @gmischler , now I'm aware how the solution can be delicated, I have a trouble and I looked for a solution, That's why, I'm here.
@Lucas-C Thank you so much for your clarifications, Now I can hope a solution xD.
I profiled issue_907.py using cProfile:
python -m cProfile -o issue_907.pstats issue_907.py
There is the result as a SVG flame graph (produced using flameprof):
It makes very obvious the places were optimizations would be worthwile 🙂
I also opened #913 to try to make test_cell_speed_with_long_text() faster
There is the result as a SVG flame graph (produced using flameprof):
It might be interesting to see the same thing for 2.7.4 for comparison (and after #913, of course).
It might be interesting to see the same thing for 2.7.4 for comparison
Sure, good idea, there is the flamegraph for issue_907.py being executed with fpdf2 in version 2.7.4:
Comparing the two graphs, a few things jump out at me:
In 2.7.4, get_width() appears to have been the largest single bottleneck (not entirely unexpected).
In 2.7.5, this part has been squished into the smaller left tower, while actually picking glyphs to render takes a lot more time.
It will be interesting to see in which ways and by how much #913 will change that.
I suspect though, that really significant gains can better be reached on an architectural level than by individual local optimizations.
I suspect though, that really significant gains can better be reached on an architectural level than by individual local optimizations.
I totally agree, but right now I do not see clearly what architectural changes we should make to improve things...
There is the performance flame graph for issue_907.py on develop after merging PR #913:
Not much has changed in the shape of the stacks...
FYY I added a section in our docs on how to investigate performance issues: https://py-pdf.github.io/fpdf2/Development.html#testing-performances
I've just been looking through #913 again and it seems that the good idea of caching glyph IDs could be expanded on further.
At the moment, every time a specific glyph needs to be rendered, a new Glyph() instance is created. As far as I can tell, all of the instances for a specific glyph will be identical, which means that they are essentially redundant, and only one of each type is needed. So instead of caching glyph IDs, we should probably rather be caching Glyph() instances. In our current example, that would eliminate the instantiation of around 7 million redundant objects.
@andersonhc, am I missing anything obvious here? Is there any situation where a Glyph() instance for the same sequence of Unicode code points would have different contents than the others, or where one would be modified after creation?
If not, then anyone is free to run with the idea. Unfortunately I don't have time myself to follow up on this since I'm still focusing on getting #897 production ready.
I've just been looking through #913 again and it seems that the good idea of caching glyph IDs could be expanded on further. [...] So instead of caching glyph IDs, we should probably rather be caching
Glyph()instances. In our current example, that would eliminate the instantiation of around 7 million redundant objects.
Sounds great!! 👍
Given that PR #913 has been merged, I'm going to close this.
Please open a new issue on this subject if need be!