WeasyPrint icon indicating copy to clipboard operation
WeasyPrint copied to clipboard

Incorrect line breaking with boxes

Open aschmitz opened this issue 3 years ago • 4 comments

At a high level, there are two separate issues that are currently colliding:

  1. We [almost] treat inline boxes (like <span>s) as if they were separate words for wrapping purposes.
  2. We don't currently [always] have a way to wrap them properly even if they were.

A minimal test document may be helpful:

<style>
  html { overflow-wrap: break-word; }
  div { background-color: #abcdef; width: 20em; }
</style>

<div>
  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa<span>bbbbbbbbbb</span>
</div>

In Firefox (much like Chrome), this wraps in the middle of the b's:

The sample document in Firefox

In WeasyPrint (as of the current master, 2f8d4999d70c9fa2ea26d7accc30a75481a22542), we never wrap:

Screenshot from 2022-03-24 20-54-06

I'm currently hitting this with some syntax-highlighted content that should wrap but instead runs off the page, for a sample real-world use case.

This appears to be happening because split_first_line is using the is_line_start parameter to determine whether something is the line's first word. It turns out that we set this a bit naïvely in split_inline_level such that it's only true if the box is the first box on the line rather than if it is in the first "otherwise unbreakable sequence of characters".

I suspect the "right" way to fix this is to return a value indicating whether there has been a space in the line so far or not, and setting is_line_start accordingly. That's probably straightforward enough, but I had expected that doing so would mean we wouldn't correctly backtrack if there were a space in the previous text box. This turns out to be false: even now, changing one of the "a"s to a space allows the line to break at that position, giving a more appropriate render (which also matches the major browsers' behavior). Clearly I've missed some detail of how line breaking works in that circumstance, and I'm hesitant to just pile on an additional patch until I understand the underlying behavior a bit better.

Unfortunately, while it's a neat puzzle that would be fun to resolve, I don't know if I'll be able to take the time to grok the whole line-breaking system in the immediate future (and certainly not today). If you have pointers as to what might be making the space in the earlier TextBox(?) allow breaking due to the combined line being too long, that would be much appreciated, and I can try to make a PR to resolve this issue. (The other open question right now is whether there's an easy cue for "this was breakable" inside split_first_line, but hopefully so.)

aschmitz avatar Mar 25 '22 02:03 aschmitz

I suspect the "right" way to fix this is to return a value indicating whether there has been a space in the line so far or not, and setting is_line_start accordingly.

Looks like a good idea. As spaces are not the only characters that allow line breaks, you probably want to use can_break_text.

even now, changing one of the "a"s to a space allows the line to break at that position, giving a more appropriate render

That’s right, this case is really common and is supposed to be correctly handled. The logic is in split_inline_box, where we keep unbreakable inline boxes in a temporary list (and also check that we can or can’t break between inline boxes).

Unfortunately, while it's a neat puzzle that would be fun to resolve, I don't know if I'll be able to take the time to grok the whole line-breaking system in the immediate future

The line-breaking algorithm is complex (should I say "dirty"?). Fixing this problem probably requires a very small amount of code, but we’ll need to spend a lot of time to dive into this code. And each time we change something, we break common use cases even if we already have countless tests!

Nevertheless, of course, now that this issue is reported, I can’t wait to fix it. I’ll try to spend some time on it "soon" (in April?).

liZe avatar Mar 30 '22 08:03 liZe