textwrap icon indicating copy to clipboard operation
textwrap copied to clipboard

Handle all Unicode whitespace for UnicodeBreakProperties.find_words

Open phy1729 opened this issue 11 months ago • 4 comments

Currently UnicodeBreakProperties.find_words("foo \nbar") results in two Words where the first has word: "foo \n", whitespace: "". I think it makes more sense when following the Unicode line break algorithm to include all whitespace not just spaces as UAX 14 breaks after a newline.

I left Word::from_unicode pub(crate) to not change the public API.

phy1729 avatar Dec 20 '24 20:12 phy1729

Hi @phy1729,

Thanks for the PR and happy new year!

Currently UnicodeBreakProperties.find_words("foo \nbar") results in two Words where the first has word: "foo \n", whitespace: "". I think it makes more sense when following the Unicode line break algorithm to include all whitespace not just spaces as UAX 14 breaks after a newline.

I see... I think the API I made is a bit clumsy or misleading. Textwrap is generally meant to wrap a single paragraph of text at a time. This is reflected in how wrap and fill will split on newlines and wrap each line independently.

However, there is of course nothing about this in the documentation for Word itself — the only hint is that UnicodeBreakProperties talk about line here and there.

In general, I was hoping that Word would be a trivial little struct that others could replace as needed. So you should not feel constrained by how it works. An example of this is in the Wasm demo:

https://github.com/mgeisler/textwrap/blob/c9bd8b0b807b1b62e388e5aeb9a3d7f3276cff84/examples/wasm/src/lib.rs#L44-L51

However, I realize that this idea isn't how people want to use the current API and I would like to improve on this.

mgeisler avatar Jan 12 '25 18:01 mgeisler

Looking at this some more, I think the output between the AsciiSpace and UnicodeBreakProperties enum variants ought to be aligned for ASCII input. This is not at all the case today, where this little program

use textwrap::core::Word;
use textwrap::WordSeparator::{AsciiSpace, UnicodeBreakProperties};

fn main() {
    let text = "foo   \nbar";
    dbg!(UnicodeBreakProperties.find_words(text).collect::<Vec<_>>());
    dbg!(AsciiSpace.find_words(text).collect::<Vec<_>>());

gives me

[examples/tmp.rs:7:5] UnicodeBreakProperties.find_words(text).collect::<Vec<_>>() = [
    Word {
        word: "foo   \n",
        whitespace: "",
        penalty: "",
        width: 6,
    },
    Word {
        word: "bar",
        whitespace: "",
        penalty: "",
        width: 3,
    },
]
[examples/tmp.rs:8:5] AsciiSpace.find_words(text).collect::<Vec<_>>() = [
    Word {
        word: "foo",
        whitespace: "   ",
        penalty: "",
        width: 3,
    },
    Word {
        word: "\nbar",
        whitespace: "",
        penalty: "",
        width: 3,
    },
]

mgeisler avatar Jan 12 '25 18:01 mgeisler

Can I ask what your use-case is for constructing words yourself? Also, have you tried the Custom variant?

I recently realized that the constructor for Word is very limiting since it doesn't allow you to set the word, penalty and whitespace yourself — and you cannot touch the cached length. This ought to be reworked, I'm afraid.

mgeisler avatar Jan 12 '25 18:01 mgeisler

I'm using the result of find_words to find the widest word in each column of a table to set a minimum width. https://github.com/phy1729/zxcv/blob/8764f7ad0ce3b19e16fa922cb2f4163b454ef6a1/src/html/table.rs#L62-L66 . So I'm not constructing words myself, I was just surprised that there was trailing whitespace in the word member which caused some incorrect minimums.

I don't think the words for AsciiSpace and UnicodeBreakProperties can be aligned even for just ASCII inputs without an API break as UAX 14 mandates the break happen after the newline and AsciiSpace as its name implies only considers spaces. If it were AsciiWhitespace and char::is_ascii_whitespace used, I think the words would be the same, but I haven't done a careful reading of the standard to be sure.

phy1729 avatar Jan 14 '25 05:01 phy1729