textwrap icon indicating copy to clipboard operation
textwrap copied to clipboard

No explicit control over CR LF line ending

Open koiuo opened this issue 3 years ago • 6 comments
trafficstars

Helix editor recently added a feature to reflow a piece of text; helix uses textwrap for this

There's an issue, where reflow does not work correctly if CRLF sequence (\r\n) is used as a line terminator https://github.com/helix-editor/helix/issues/2645

helix does not support CR ending (pre macosx), but I guess, the issue would manifest with it as well.

As far as I can see, textwrap does not account for different line ending, there's no configuration option with which we could tune textwrap behavior.

Did I miss anything? Is there a way to support CRLF in textwrap?

If not, I'd be willing to contribute a fix. My initial idea is to introduce another option into the Options struct, named line_ending (or line_terminator as a fancier alternative). To keep things tight, I'd express it as enum (we can always create something like Custom should we want to support an arbitrary byte sequence later, but I highly doubt that). How does that sound?

koiuo avatar Jun 02 '22 22:06 koiuo

Hey @edio, thanks for opening the issue. Super exciting that textwrap is used in the editor!

You're not missing anything, textwrap is hard-coded to split on \n in textwrap::wrap. I see you're using textwrap::refill and that is also hard-coded to use \n.

I've been thinking of \n as the logical way to separate lines and \r\n as more of a fileformat encoding of the logical \n characters. That is, my opinion is that programs should use \n throughout their processing and then encode the string to the target file format (e.g., do a .replace("\n", "\r\n") call at the point where the data leaves the program).

Searching a bit, it seems that this way of thinking comes from how languages like C++ handle text mode files. Rust doesn't have a text mode, so I think my way of thinking is outdated :-D

Now, I see a few different approaches for handling this:

  • We could either push the problem back to the caller: demand that you use \n delimited string with Textwrap. That matches my way of thinking above. Easiest solution to implement for Textwrap :smile:
  • Alternatively, we could split on a custom line ending as you suggest.
  • Finally, we could perhaps also use .lines() internally and ask people to use textwrap::wrap (which return a Vec<&str>) instead of textwrap::fill if they want to use custom line endings.

Out of the options, I think the first two are the best. Since Textwrap tries to be useful for many use-cases, I will be happy to see a custom enum like you suggest. I agree that two enum variants ought to be enough for now. Since we split the input just once, I think the performance impact should be negligible, but please compare

% cargo criterion fill_optimal_fit_ascii/4800

before and after to make sure.

mgeisler avatar Jun 03 '22 16:06 mgeisler

Another thing: I think Helix is the first place which uses textwrap::refill for real. So if you run into problems with it, then please let me know. Most of the logic is in textwrap::unfill and it could use some real-world feedback since it's quite fuzzy what it should do on different input.

mgeisler avatar Jun 03 '22 16:06 mgeisler

@mgeisler , thanks for the response. I do not represent the helix dev team. I just had a couple of free evenings and wanted to find an interesting task to work on.

I considered transforming the input before feeding it to textwrap, but that seemed inefficient (and most of all cumbersome).

.lines() + ~textwrap::fill~textwrap:wrap seems like a nice and clean workaround in terms of code. ~That still leaves us with the default \n in the output, however (so it is still cumbersome, even if reduced by half 😁)~

Perhaps I should summon @kirawi here, who's been handling the issue in the helix repo and, I assume, familiar with the helix codebase (unlike me). @kirawi, please read Martin's comment https://github.com/mgeisler/textwrap/issues/453#issuecomment-1146168714 and share your opinion about the proposed solutions. Thank you!

EDIT: I misread. The proposal was to use lines() + textwrap::wrap. Works better for the client, but introduces a slight behavoir inconsistency for this lib: it handles any line ending in the input, but only outputs \n. Not sure if it's a problem.

koiuo avatar Jun 03 '22 17:06 koiuo

Works better for the client, but introduces a slight behavoir inconsistency for this lib: it handles any line ending in the input, but only outputs \n. Not sure if it's a problem.

My thinking was that the client (the caller of wrap) would use \n or \r\n as desired to put the lines back together. Perhaps the caller would use neither: if the lines are meant to be displayed in an editor (:wink:), then perhaps the lines are just sent one by one to the display engine.

Thanks for the pull request, I'll go take a look now!

mgeisler avatar Jun 05 '22 13:06 mgeisler

@mgeisler , I benchmarked wrap with lines() vs split('\n') on 4800 input, and consistently across multiple runs the lines() version is slower. lines() additionally removes \r in the end of the line after the split, so it being slower is expected. My laptop isn't tuned for benchmarking (variable CPU frequency, etc.), but the best case of split('\n') is 5% faster than the best case of lines().

IMO, if there's no explicit control over the line ending, lines() should be the default behavior. I'm not sure 5% performance gain is worth the reduced portability (esp. considering Rust targets Windows too), and preprocessing of input will kill the 5% gain if input has CRLF.

Anyway switching to lines() conflicts with the line_ending option. So either we do lines() and call it a day (this will solve the issue for helix), or we introduce the LineEnding.

If we do lines() now, we can still introduce the LineEnding later (in this sense it is a safe option). We'll just want to have a special LineEnding::Auto value. This would be the default, and a specific value like CRLF would give a slight performance boost.

koiuo avatar Jun 05 '22 21:06 koiuo

@mgeisler , I benchmarked wrap with lines() vs split('\n') on 4800 input, and consistently across multiple runs the lines() version is slower. lines() additionally removes \r in the end of the line after the split, so it being slower is expected. My laptop isn't tuned for benchmarking (variable CPU frequency, etc.), but the best case of split('\n') is 5% faster than the best case of lines().

IMO, if there's no explicit control over the line ending, lines() should be the default behavior. I'm not sure 5% performance gain is worth the reduced portability (esp. considering Rust targets Windows too), and preprocessing of input will kill the 5% gain if input has CRLF.

Anyway switching to lines() conflicts with the line_ending option. So either we do lines() and call it a day (this will solve the issue for helix), or we introduce the LineEnding.

If we do lines() now, we can still introduce the LineEnding later (in this sense it is a safe option). We'll just want to have a special LineEnding::Auto value. This would be the default, and a specific value like CRLF would give a slight performance boost.

I like what you've done in https://github.com/mgeisler/textwrap/pull/454: let the caller of wrap and friends specify what line ending to use both for splitting the lines and, in the case of fill, to use for joining the generated lines.

Let us continue discussing there!

mgeisler avatar Jun 06 '22 16:06 mgeisler

Fixed by #454, thanks!

mgeisler avatar Sep 17 '22 19:09 mgeisler

This was a semver breaking change since textwrap::Options is an exhaustive struct with pub fields. Code like:

fn textwrap_options() -> textwrap::Options<'static> {
	textwrap::Options {
		width: textwrap::termwidth(),
		initial_indent: "    ",
		subsequent_indent: "    ",
		break_words: true,
		wrap_algorithm: textwrap::WrapAlgorithm::OptimalFit(Default::default()),
		word_separator: textwrap::WordSeparator::UnicodeBreakProperties,
		word_splitter: textwrap::WordSplitter::NoHyphenation,
	}
}

breaks when updating from v0.15.0 to v0.15.1

Arnavion avatar Oct 18 '22 02:10 Arnavion

Hi @Arnavion, oh no, I didn't think of that... Thanks for noticing! I'll yank the release and redo the release on the weekend.

I'm curious through, do you have code like that? Why would you write it like that when Options come with a built-in builder?

mgeisler avatar Oct 18 '22 18:10 mgeisler

I'm curious through, do you have code like that?

Yes. It failed to compile yesterday and that's why I started investigating and reached here.

Why would you write it like that when Options come with a built-in builder?

I wanted to initialize all the fields with values of my choice, so the builder would be more verbose than setting the fields directly. Also, having a struct with pub fields is an invitation to let users set those fields manually, ie it's an explicit signal that it's okay to set those fields.

Arnavion avatar Oct 18 '22 18:10 Arnavion

an invitation to let users set those fields manually, ie it's an explicit signal that it's okay to set those fields.

Yeah, I get that — I made them public to make it easy to reassign a few fields when needed. I just didn't expect anybody to actually use the struct literal syntax itself.

Thanks for letting me know, I'll fix it as soon as I can.

mgeisler avatar Oct 18 '22 18:10 mgeisler

Yeah, in that case you can make it #[non_exhaustive] to prevent this from happening in the future.

Arnavion avatar Oct 18 '22 18:10 Arnavion

Thanks @Arnavion, I've yanked 0.15.1 and released 0.16.0 instead.

mgeisler avatar Oct 23 '22 19:10 mgeisler

Please unyank 0.15.1, clap 3.2.22 explicitly depends on it.

nox avatar Oct 24 '22 07:10 nox

https://github.com/clap-rs/clap/issues/4418

nox avatar Oct 24 '22 07:10 nox

0.15.0 should have been rereleased as 0.15.2 to make sure downstream users explicitly depending on 0.15.1 don't get stuck.

nox avatar Oct 24 '22 07:10 nox

If any crate depends on ^0.15.1 explicitly, it would normally be doing so because it depends on API newly added in 0.15.1, so it wouldn't work with 0.15.0 released as 0.15.2. So doing what you suggest would be pointless in general.

That said, clap 3 seems to have updated to 0.15.1 without depending on the change in 0.15.1 ( https://github.com/clap-rs/clap/commit/2fd55076e65b14fa92b34161c44d5058a9966f69 ), so it would indeed work with 0.15.0 re-released as 0.15.2. Why it updated its dep for no reason is a mystery...

Arnavion avatar Oct 24 '22 09:10 Arnavion

If any crate depends on ^0.15.1 explicitly, it would normally be doing so because it depends on API newly added in 0.15.1, so it wouldn't work with 0.15.0 released as 0.15.2. So doing what you suggest would be pointless in general.

This is wrong. If I add a dependency to my system I always take the biggest version number, because then that enforces I don't rely on a bugfix without doing so explicitly, and it also circumvents issues with crates that don't believe in minor version bumps (e.g. serde). Anyway it's good practice to always rerelease a bigger version number whenever a version is yanked to restore whatever was before.

nox avatar Oct 24 '22 12:10 nox

That's fine, but I hope you defensively bump every dep you have in every commit you make, no matter how unrelated it is. After all you might have added foo dep at its latest version last week, but your clean build today might be relying on a new bugfix it released yesterday. Otherwise you're not achieving the safety you desire :)

(The only way this can be solved for sure is with -Z minimal-versions.)

Anyway, this is not the place to argue about it, and I agree that there's no harm in releasing 0.15.2.

Arnavion avatar Oct 24 '22 12:10 Arnavion

Hi @Arnavion and @nox Thanks for letting me know about this! I did not anticipate such a deadlock when I yanked the release... Let's discuss further fixes in #484.

mgeisler avatar Oct 24 '22 14:10 mgeisler