delta icon indicating copy to clipboard operation
delta copied to clipboard

Align expanded tab columns

Open dhduvall opened this issue 4 years ago • 8 comments

Vary the number of spaces tabs expand into (with --tabs > 0) so that text after the tab starts at the next tabstop and tab columns align.

Fixes #729.

dhduvall avatar Oct 27 '21 17:10 dhduvall

Given that this is clearly the Correct Way™ to handle tabs (otherwise all sorts of subtle alignment issues show up), running each line through your improved expand_tabs() should become the default.

To that end, why not optimize it a bit: Move the line.graphemes(true) call which precedes each expand_tabs() call into the function itself, make it take a &str (line[1..] in prepare(), maybe also move the format!(" {}\n", ..) to avoid creating yet another short lived String). Then check if any tabs need to be expanded: if not (or tab_width is 0) then no grapheme clusters need to be generated and iterated over.

Also, a minor improvement would be to not create a new String with " ".repeat(num_spaces) for each tabs but early on generate a tab_width-long String in the config and use substrings from that one.

th1000s avatar Oct 28 '21 22:10 th1000s

Great code review, thanks! A couple of questions, though:

  • Why is .graphemes() needed at all? The code—before or after my change—doesn't do anything with them, and just looks for tabs in the stream, which I would expect to show up in the same places whether we searched through the characters or through the graphemes; this is what I'm doing in my other PR, in ansi::measure_text_width()
  • In thinking about #522, I realized that while expanding tabs to spaces at this point in the process is fine, because the distance algorithm ignores whitespace, we can't expand them to non-spaces, because the distance algorithm can't know that <---> or whatever is actually a tab. So we either need to annotate the fact that we did some expansion, or delay the expansion until after or during the distance measurements. Before I go try to figure out how to make that happen, does that make sense to you?

dhduvall avatar Oct 29 '21 16:10 dhduvall

  • Because one byte or even one char inside a string does not equal a display width of one, that's why a Unicode library has to be asked to group them (i.e. .graphemes(), strip_ansi_codes(..).width() does something similar internally) so the tab stops are correct. Oh and a tab is always in its own group, so there is no need to call find() on the iterator element.

  • Indeed. The "correct" place is after the syntax highlighter ran (e.g. get_syntax_style_sections_for_lines() in Painter::paint_buffered_minus_and_plus_lines()). Then the the raw lines could either be fixed a) before the diff highlighting runs, or b) the diff output is fixed as well. A pure tabs-to-spaces replacement can take place where it does now (unless there is some language syntax where the highlighter produces different output, maybe Makefiles?). This function can also be reused if strategy a) is chosen.

th1000s avatar Oct 31 '21 23:10 th1000s

Why do we need to iterate over the graphemes? If a tab is guaranteed to be the one character in a grapheme, then we don't need to process the graphemes individually: whatever comes before or after the tab is going to be the same set of bytes/characters/graphemes after the tab replacement as it was before, no? So then we can just ask for the width of that &str in its entirety, and let the unicode library do the grapheme processing internally.

dhduvall avatar Nov 01 '21 04:11 dhduvall

If a tab is guaranteed to be the one character in a grapheme, then we don't need to process the graphemes individually: whatever comes before or after the tab is going to be the same set of bytes/characters/graphemes after the tab replacement as it was before, no?

I believe that's right -- as I understand things, by UTF-8 spec an ASCII byte such as \t does not occur in any UTF-8 multibyte characters, but I probably forgot that when I wrote the code, and thought that the only safe way to do it was to iterate over graphemes. @th1000s does that sound right?

dandavison avatar Nov 01 '21 14:11 dandavison

Yes, when not taking tabstops into account the graphemes can be ignored.

But with this patch the exact display width up to a tab must be known. That could be done by finding the next tab position, and then letting some other function calculate the width from the previous tab position (or its replacement whitespace) up to this tab position. Internally, that function will do the same grapheme iteration, but we we are stuck with more bookkeeping and str generation - so why not start with graphemes and just add them up?

th1000s avatar Nov 01 '21 21:11 th1000s

Just stropping by to say that this branch will be great to have. Let me know when you feel that it's wrapped up.

dandavison avatar Nov 22 '21 23:11 dandavison

Sorry; my life has been upside-down the past few weeks. I'll see if I can give it some attention this long weekend.

dhduvall avatar Nov 22 '21 23:11 dhduvall