openhtmltopdf
openhtmltopdf copied to clipboard
Generation of pdf is too slow for large html
I am now using version 1.0.2, but the pdf build is still hang. The size of html is 13241929 I have tried many times and increased the heap size to 4G. My running machine is i5 4460, 16G RAM.
Attafched with the test html test.txt
My code for pdf generation is as follow:
public byte[] generateFromHtml(String html) throws Exception {
try (ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream()) {
PdfRendererBuilder builder = new PdfRendererBuilder();
builder.useFont(getFont(PMingLiU), "PMingLiU");
builder.useFont(getFont(PMingLiUExtB), "PMingLiU-ExtB");
builder.useFont(getFont(seguiemj), "Segoe UI Emoji");
builder.withHtmlContent(html, null);
builder.useFastMode();
builder.toStream(byteArrayOutputStream);
builder.run();
return byteArrayOutputStream.toByteArray();
}
}
Originally posted by @Infinity821 in https://github.com/danfickle/openhtmltopdf/issues/180#issuecomment-640995477
hi @Infinity821 ,
Using the master branch and version 1.0.3, I've been able to generate the pdf using the attached test html.
Code for pdf generation (note, I was not able to find the correct font for PMingLiU-ExtB
, but I don't think it has an effect):
try (OutputStream os = new FileOutputStream("out.pdf")) {
PdfRendererBuilder builder = new PdfRendererBuilder();
builder.useFont(new File("PMINGLIU.ttf"), "PMingLiU");
builder.useFont(new File("PMINGLIU.ttf"), "PMingLiU-ExtB");
builder.useFont(new File("seguiemj.ttf"), "Segoe UI Emoji");
builder.useFastMode();
builder.withFile(new File("test" +
".html"));
builder.toStream(os);
builder.run();
}
resulting pdf: out.pdf
By the way, have you tried with the version 1.0.3?
(Using a 16gb ram ryzen 1700 pc, java 11, default heap configuration, execution time 5716ms)
I've noticed that with heavy mixed font text, up to 80% of cpu self-time is spent initialising the IllegalArgumentException
that pdfbox uses to indicate that the current font does not support passed in characters. Therefore, it may be a large performance gain to change to a canDisplayUpTo
method, but it would require work on pdfbox as well as this project.
P.s. According to VisualVM.
@Infinity821 , can you try cpu sampling with visualvm and posting a screenshot of hotspots?
I've got the same issues with some very (very) large HTML files (up to 600 MB). I have several files that ends up in a OOM, so I had to test some smaller files ( ~ 22 MB)
I can confirm that many IllegalArgumentException are raised, as seen in the following screenshot (from a JFR recording):
Unfortunately I can't test a larger file due to the memory limitation (-Xmx13g -XX:+UseG1GC).
Here is some other useful metrics :
Is there any way to prevent OOM (even if the generation takes longer)
@danfickle I'm willing to provides some HTML sample in PM if you need to
The biggest problem seems to be caused by the numerous zerowidthspace characters inserted for whitespace contained within the HTML. It is not available in Helvetica and width should just be zero (name says it). I checked the HTML for any zerowidthspaces that I could remove, but they seem to be inserted internally. 🤷♂️
java.lang.IllegalArgumentException: U+200B ('zerowidthspace') is not available in this font Helvetica encoding: WinAnsiEncoding at org.apache.pdfbox.pdmodel.font.PDType1Font.encode(PDType1Font.java:427) at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:333) at org.apache.pdfbox.pdmodel.font.PDFont.getStringWidth(PDFont.java:364) at com.openhtmltopdf.pdfboxout.PdfBoxTextRenderer.getWidth(PdfBoxTextRenderer.java:337) at com.openhtmltopdf.layout.Breaker.lambda$doBreakText$1(Breaker.java:526) at com.openhtmltopdf.layout.Breaker.doBreakTextWords(Breaker.java:560) at com.openhtmltopdf.layout.Breaker.doBreakText(Breaker.java:531) at com.openhtmltopdf.layout.Breaker.doBreakText(Breaker.java:317) at com.openhtmltopdf.layout.Breaker.breakText(Breaker.java:188) at com.openhtmltopdf.layout.InlineBoxing.layoutText(InlineBoxing.java:1126) at com.openhtmltopdf.layout.InlineBoxing.startInlineText(InlineBoxing.java:410) at com.openhtmltopdf.layout.InlineBoxing.layoutContent(InlineBoxing.java:192) at com.openhtmltopdf.render.BlockBox.layoutInlineChildren(BlockBox.java:1227) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1208) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299) at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:90) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1211) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.newtable.TableRowBox.layoutCell(TableRowBox.java:452) at com.openhtmltopdf.newtable.TableRowBox.layoutChildren(TableRowBox.java:206) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.newtable.TableRowBox.layout(TableRowBox.java:95) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299) at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:103) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1211) at com.openhtmltopdf.newtable.TableSectionBox.layoutChildren(TableSectionBox.java:137) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.newtable.TableSectionBox.layout(TableSectionBox.java:278) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299) at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:109) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1211) at com.openhtmltopdf.newtable.TableBox.layoutChildren(TableBox.java:316) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.newtable.TableBox.layoutTable(TableBox.java:281) at com.openhtmltopdf.newtable.TableBox.layout(TableBox.java:240) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299) at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:109) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1211) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.newtable.TableRowBox.layoutCell(TableRowBox.java:452) at com.openhtmltopdf.newtable.TableRowBox.layoutChildren(TableRowBox.java:206) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.newtable.TableRowBox.layout(TableRowBox.java:95) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299) at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:103) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1211) at com.openhtmltopdf.newtable.TableSectionBox.layoutChildren(TableSectionBox.java:137) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.newtable.TableSectionBox.layout(TableSectionBox.java:278) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299) at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:103) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1211) at com.openhtmltopdf.newtable.TableBox.layoutChildren(TableBox.java:316) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.newtable.TableBox.layoutTable(TableBox.java:281) at com.openhtmltopdf.newtable.TableBox.layout(TableBox.java:240) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299) at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:109) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1211) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299) at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:90) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1211) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299) at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:90) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1211) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.pdfboxout.PdfBoxRenderer.layout(PdfBoxRenderer.java:346) at com.openhtmltopdf.pdfboxout.PdfRendererBuilder.run(PdfRendererBuilder.java:45)
I'm trying to convert html with 30 MB and it takes around 50 sec anyway to enhance this.