openhtmltopdf icon indicating copy to clipboard operation
openhtmltopdf copied to clipboard

Generation of pdf is too slow for large html

Open Infinity821 opened this issue 4 years ago • 5 comments

I am now using version 1.0.2, but the pdf build is still hang. The size of html is 13241929 I have tried many times and increased the heap size to 4G. My running machine is i5 4460, 16G RAM.

Attafched with the test html test.txt

My code for pdf generation is as follow:

    public byte[] generateFromHtml(String html) throws Exception {
        try (ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream()) {
            PdfRendererBuilder builder = new PdfRendererBuilder();
            builder.useFont(getFont(PMingLiU), "PMingLiU");
            builder.useFont(getFont(PMingLiUExtB), "PMingLiU-ExtB");
            builder.useFont(getFont(seguiemj), "Segoe UI Emoji");
            builder.withHtmlContent(html, null);
            builder.useFastMode();
            builder.toStream(byteArrayOutputStream);
            builder.run();
            return byteArrayOutputStream.toByteArray();
        }
    }

Originally posted by @Infinity821 in https://github.com/danfickle/openhtmltopdf/issues/180#issuecomment-640995477

Infinity821 avatar Jun 22 '20 01:06 Infinity821

hi @Infinity821 ,

Using the master branch and version 1.0.3, I've been able to generate the pdf using the attached test html.

Code for pdf generation (note, I was not able to find the correct font for PMingLiU-ExtB, but I don't think it has an effect):

try (OutputStream os = new FileOutputStream("out.pdf")) {
            PdfRendererBuilder builder = new PdfRendererBuilder();
            builder.useFont(new File("PMINGLIU.ttf"), "PMingLiU");
            builder.useFont(new File("PMINGLIU.ttf"), "PMingLiU-ExtB");
            builder.useFont(new File("seguiemj.ttf"), "Segoe UI Emoji");
            builder.useFastMode();
            builder.withFile(new File("test" +
                    ".html"));

            builder.toStream(os);
            builder.run();
        }

resulting pdf: out.pdf

By the way, have you tried with the version 1.0.3?

(Using a 16gb ram ryzen 1700 pc, java 11, default heap configuration, execution time 5716ms)

syjer avatar Jun 22 '20 11:06 syjer

I've noticed that with heavy mixed font text, up to 80% of cpu self-time is spent initialising the IllegalArgumentException that pdfbox uses to indicate that the current font does not support passed in characters. Therefore, it may be a large performance gain to change to a canDisplayUpTo method, but it would require work on pdfbox as well as this project.

P.s. According to VisualVM.

@Infinity821 , can you try cpu sampling with visualvm and posting a screenshot of hotspots?

danfickle avatar Jun 22 '20 14:06 danfickle

I've got the same issues with some very (very) large HTML files (up to 600 MB). I have several files that ends up in a OOM, so I had to test some smaller files ( ~ 22 MB)

I can confirm that many IllegalArgumentException are raised, as seen in the following screenshot (from a JFR recording): image

Unfortunately I can't test a larger file due to the memory limitation (-Xmx13g -XX:+UseG1GC).

Here is some other useful metrics :

image

Is there any way to prevent OOM (even if the generation takes longer)

@danfickle I'm willing to provides some HTML sample in PM if you need to

olivergg avatar Mar 02 '21 07:03 olivergg

The biggest problem seems to be caused by the numerous zerowidthspace characters inserted for whitespace contained within the HTML. It is not available in Helvetica and width should just be zero (name says it). I checked the HTML for any zerowidthspaces that I could remove, but they seem to be inserted internally. 🤷‍♂️

java.lang.IllegalArgumentException: U+200B ('zerowidthspace') is not available in this font Helvetica encoding: WinAnsiEncoding at org.apache.pdfbox.pdmodel.font.PDType1Font.encode(PDType1Font.java:427) at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:333) at org.apache.pdfbox.pdmodel.font.PDFont.getStringWidth(PDFont.java:364) at com.openhtmltopdf.pdfboxout.PdfBoxTextRenderer.getWidth(PdfBoxTextRenderer.java:337) at com.openhtmltopdf.layout.Breaker.lambda$doBreakText$1(Breaker.java:526) at com.openhtmltopdf.layout.Breaker.doBreakTextWords(Breaker.java:560) at com.openhtmltopdf.layout.Breaker.doBreakText(Breaker.java:531) at com.openhtmltopdf.layout.Breaker.doBreakText(Breaker.java:317) at com.openhtmltopdf.layout.Breaker.breakText(Breaker.java:188) at com.openhtmltopdf.layout.InlineBoxing.layoutText(InlineBoxing.java:1126) at com.openhtmltopdf.layout.InlineBoxing.startInlineText(InlineBoxing.java:410) at com.openhtmltopdf.layout.InlineBoxing.layoutContent(InlineBoxing.java:192) at com.openhtmltopdf.render.BlockBox.layoutInlineChildren(BlockBox.java:1227) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1208) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299) at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:90) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1211) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.newtable.TableRowBox.layoutCell(TableRowBox.java:452) at com.openhtmltopdf.newtable.TableRowBox.layoutChildren(TableRowBox.java:206) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.newtable.TableRowBox.layout(TableRowBox.java:95) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299) at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:103) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1211) at com.openhtmltopdf.newtable.TableSectionBox.layoutChildren(TableSectionBox.java:137) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.newtable.TableSectionBox.layout(TableSectionBox.java:278) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299) at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:109) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1211) at com.openhtmltopdf.newtable.TableBox.layoutChildren(TableBox.java:316) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.newtable.TableBox.layoutTable(TableBox.java:281) at com.openhtmltopdf.newtable.TableBox.layout(TableBox.java:240) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299) at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:109) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1211) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.newtable.TableRowBox.layoutCell(TableRowBox.java:452) at com.openhtmltopdf.newtable.TableRowBox.layoutChildren(TableRowBox.java:206) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.newtable.TableRowBox.layout(TableRowBox.java:95) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299) at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:103) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1211) at com.openhtmltopdf.newtable.TableSectionBox.layoutChildren(TableSectionBox.java:137) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.newtable.TableSectionBox.layout(TableSectionBox.java:278) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299) at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:103) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1211) at com.openhtmltopdf.newtable.TableBox.layoutChildren(TableBox.java:316) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.newtable.TableBox.layoutTable(TableBox.java:281) at com.openhtmltopdf.newtable.TableBox.layout(TableBox.java:240) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299) at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:109) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1211) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299) at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:90) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1211) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321) at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299) at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:90) at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:1211) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:1065) at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:984) at com.openhtmltopdf.pdfboxout.PdfBoxRenderer.layout(PdfBoxRenderer.java:346) at com.openhtmltopdf.pdfboxout.PdfRendererBuilder.run(PdfRendererBuilder.java:45)

rudolphi avatar Dec 06 '21 20:12 rudolphi

I'm trying to convert html with 30 MB and it takes around 50 sec anyway to enhance this.

mhmmdgamal avatar Feb 04 '24 09:02 mhmmdgamal