WeasyPrint icon indicating copy to clipboard operation
WeasyPrint copied to clipboard

Strange text overflow

Open arnKo opened this issue 2 years ago • 5 comments

We discovered the problem trying to render the following list of file extensions: .png, .jpg, .jpeg, .jpe, .gif, .zip, .doc, .docx, .docm, .xls, .xlsx, .xlsm, .ppt, .pptx, .pptm, .pps, .ppsx, .odt, .ods, .odp, .odf, .rtf, .pdf, .psd, .csv, .msg, .mp4, .webm, .xlf, .xliff

Weasyprint renders the line without breaking it although it overflows the page. However, if we remove the dots from the extensions, suddenly the text is rendered correctly in the page's boundaries.

I added a minimal example below that generates the following PDF. The first line is not broken at all. In the second line, I removed the dot from the second extension .jpg and the line breaks. Only the third row without any dot renders correctly.

image

from weasyprint import HTML

text = """
<html>
    <head>
        <style>
            p {
                border: 1px solid red;
            }
        </style>
    </head>
    <body>
        <p>
            .png, .jpg, .jpeg, .jpe, .gif, .zip, .doc, .docx, .docm, .xls, .xlsx, .xlsm, .ppt, .pptx, .pptm, .pps, .ppsx, .odt, .ods, .odp, .odf, .rtf, .pdf, .psd, .csv, .msg, .mp4, .webm, .xlf, .xliff
        </p>
        <p>
            .png, jpg, .jpeg, .jpe, .gif, .zip, .doc, .docx, .docm, .xls, .xlsx, .xlsm, .ppt, .pptx, .pptm, .pps, .ppsx, .odt, .ods, .odp, .odf, .rtf, .pdf, .psd, .csv, .msg, .mp4, .webm, .xlf, .xliff
        </p>
        <p>
            png, jpg, jpeg, jpe, gif, zip, doc, docx, docm, xls, xlsx, xlsm, ppt, pptx, pptm, pps, ppsx, odt, ods, odp, odf, rtf, pdf, psd, csv, msg, mp4, webm, xlf, xliff
        </p>
    </body>
</html>
"""

HTML(string=text).write_pdf("/tmp/weasyprint-test.pdf")

I first discovered this bug using version 57.2. I updated to 58.0 but the bug persists. I also tried different CSS values for white-space, hyphen and word-break but nothing changes.

arnKo avatar Feb 23 '23 16:02 arnKo

What a strange bug… 😢

liZe avatar Feb 24 '23 20:02 liZe

Browsers break lines as we might expect, after the commas. But LibreOffice forces a line break anywhere, so it probably means that we can’t break after the comma according to the Unicode line break rules. I suppose that there’s an extra rule in the CSS specification to allow line breaks in this case, I’ll try find references in Unicode and CSS to define exactly what we have to do.

liZe avatar Feb 25 '23 09:02 liZe

According to Unicode, we can’t break lines before commas, spaces and dots, that’s why WeasyPrint doesn’t break the line.

I didn’t find anything about this case in the W3C Typography specification. The CSS specification says that "CSS does not fully define where soft wrap opportunities occur", so technically it may not be a bug in WeasyPrint, but I’d be interested to know why browsers decided to split such lines.

liZe avatar Feb 25 '23 20:02 liZe

But shouldn't it work using overflow-wrap: anywhere?

At least according to the specs:

The overflow-wrap property allows the UA to take a break anywhere in otherwise-unbreakable strings that would otherwise overflow.

arnKo avatar Feb 27 '23 09:02 arnKo

But shouldn't it work using overflow-wrap: anywhere?

It works … but not in this example. There’s something really strange…

liZe avatar Mar 07 '23 13:03 liZe