coreutils
coreutils copied to clipboard
feat: implement `tr`
This pull request implements the tr command, which is documented in section 9.1 tr: Translate, squeeze, and/or delete characters.
One aspect of this implementation of tr that differs from GNU coreutils is how each character is handled. GNU coreutils processes byte-for-byte, not taking into account multibyte codepoints. This implementation, however, processes codepoint-for-codepoint. Here are some examples.
echo old | tr old new: Translating a single-byte codepoint to another single-byte codepoint works the same between this implementation and GNU coreutils' implementation.
| Input | Output (V coreutils) | Output (GNU coreutils) | |
|---|---|---|---|
| Codepoints | old | new | new |
| Bytes |
|
|
|
echo ! | tr ! ❗: Translating a single-byte codepoint into a multibyte codepoint. The GNU implementation only replaces the first byte (226), which means programs that expect UTF-8 encoded data might render the replacement codepoint � (U+FFFD) in its place to represent incomplete data
| Input | Output (V coreutils) | Output (GNU coreutils) | |
|---|---|---|---|
| Codepoints | ! | ❗ | � |
| Bytes |
|
|
|
If replacing entire codepoints is undesirable, it's very easy to revert to the byte-for-byte behavior.
As of now, this is what is implemented
- [ ]
tr <arg1> <arg2> - [X]
tr -d <arg>,tr --delete <arg> - [ ]
tr -t <arg1> <arg2>,tr --truncate-set1 <arg1> <arg2> - [ ]
tr -s <arg1> <arg2>,tr --squeeze-repeats <arg1> <arg2> - [ ]
tr -c <arg1> <arg2>,tr --complement <arg1> <arg2>
The project goal is to re-implement the GNU coreutils tools behaviors as closely as possible, including error messages etc.
However, a new option could be added, to allow proper handling of multi-byte chars.
Nothing says we can't have options they don't, just that the default, and any options they do have, should act as close to theirs as possible.
However, a new option could be added, to allow proper handling of multi-byte chars.
Nothing says we can't have options they don't, just that the default, and any options they do have, should act as close to theirs as possible.
A new option, will make it harder to test again the target. The current infrastructure is setup to expect the same outputs/help screens etc. I think that new options are nice, but perhaps should be added at a later date, when we have more coverage and improved infrastructure that can allow it.