coreutils
coreutils copied to clipboard
`ptx`: Do not use regex when -W options is not provided
ptx
rely on regex even when -W option is not provided.
https://github.com/uutils/coreutils/blob/65467ab317d885e14385aa0ff6bb55720160cc88/src/uu/ptx/src/ptx.rs#L118-L141
ptx
should break words based on 8-bits charset when gnu extension is enabled or based on [' ', '\t', '\n']
chars when gnu extension is disabled, but don't use regex when -W is not specified.
Ref: gnu ptx
I realized this while work on #1763, I think it is better implement this first.
Just for completeness, before this gets closed, the word-regexp logic in the original code is correct, from the GNU ptx docs:
By default, if GNU extensions are enabled, a word is a sequence of letters; the regexp used is ‘\w+’. When GNU extensions are disabled, a word is by default anything which ends with a space, a tab or a newline; the regexp used is ‘[^ \t\n]+’.
what the code does is set the default regex used to identify a word.
@mike-kfed this issue is about how ptx
find and process words. Current ptx code, doesn't use regular expression to find a word, it uses a hashmap instead.
See Ref link on first post:
Yes in the original C code that is correct, and probably done for performance reasons. However this can be solved by a regular expression too, and, even though I haven't written the original rust translation, I assume it was done with regex because it is easier to write. The project only strives to be compatible and not an exact copy. (also IMO it does not really matter how perfomant ptx is, it's virtually unused in the real world today.)
However I removed the reference in my PR to this issue, it is not for me to decide what goes into uutils. side-note: this discussion made me realise that I have a fat bug in my breakfile code that I will fix now, thanks :)