coreutils `ptx`: Do not use regex when -W options is not provided

`ptx`: Do not use regex when -W options is not provided

Open allan-silva opened this issue 2 years ago • 4 comments

ptx rely on regex even when -W option is not provided. https://github.com/uutils/coreutils/blob/65467ab317d885e14385aa0ff6bb55720160cc88/src/uu/ptx/src/ptx.rs#L118-L141

ptx should break words based on 8-bits charset when gnu extension is enabled or based on [' ', '\t', '\n'] chars when gnu extension is disabled, but don't use regex when -W is not specified.

Ref: gnu ptx

Feb 15 '22 11:02 allan-silva

I realized this while work on #1763, I think it is better implement this first.

Feb 15 '22 11:02 allan-silva

Just for completeness, before this gets closed, the word-regexp logic in the original code is correct, from the GNU ptx docs:

By default, if GNU extensions are enabled, a word is a sequence of letters; the regexp used is ‘\w+’. When GNU extensions are disabled, a word is by default anything which ends with a space, a tab or a newline; the regexp used is ‘[^ \t\n]+’.

what the code does is set the default regex used to identify a word.

Apr 29 '22 08:04 mike-kfed

@mike-kfed this issue is about how ptx find and process words. Current ptx code, doesn't use regular expression to find a word, it uses a hashmap instead.

See Ref link on first post:

Apr 29 '22 15:04 allan-silva

Yes in the original C code that is correct, and probably done for performance reasons. However this can be solved by a regular expression too, and, even though I haven't written the original rust translation, I assume it was done with regex because it is easier to write. The project only strives to be compatible and not an exact copy. (also IMO it does not really matter how perfomant ptx is, it's virtually unused in the real world today.)

However I removed the reference in my PR to this issue, it is not for me to decide what goes into uutils. side-note: this discussion made me realise that I have a fat bug in my breakfile code that I will fix now, thanks :)

Apr 30 '22 07:04 mike-kfed

coreutils coreutils copied to clipboard

`ptx`: Do not use regex when -W options is not provided

coreutils
coreutils copied to clipboard