Adding support for operators
I am thinking of making a PR to add support for operators to shlex. This would follow the same model as Python-shlex, where the caller specifies which characters they want to treat as operators. Then, the lexer ensures that unquoted strings contain either runs of only operator or non-operator characters (in a quoted string they can still be mixed). No actual interpretation of the operators is done, they just represent a separate set of words.
The current optimisation in shlex, where the input string is iterated by byte instead of codepoint, gets in the way of this however. In order to support any Unicode codepoint as an operator, the lexer has to receive potentially multibyte characters in one go, not individual UTF-8 high bytes. Alternatively, the caller could be asked to specify operator characters as bytes, but this brings its own safety problems; what if the user specifies a high byte as an operator character?
Maybe the best approach here would be to use a separate implementation without that optimization, but only in the case where multibyte characters are relevant.
So alongside the current Shlex type, a new ShlexOperators type? It's doable, but I fear that would duplicate a lot of code, and make maintenance harder.
Could you tell me what the use-case for this is? Maybe we can come up with a better API then.
In my case, I want to include the ; operator as a token, as part of the lexing process. Right now, it just gets included in adjacent words, so that, for example both unquoted foo;bar and quoted "foo;bar" get returned by Shlex as the single word foo;bar. In the shell, only the latter would be lexed as one token, while in the former case you would get foo, ;, bar as three. Adding operator support to Shlex would allow more of the shell language to be used, and bring it closer in behaviour to its Python namesake.
Sorry, I forgot about this. I think it's a good idea, though obviously the question of how to implement it is a concern.
So alongside the current
Shlextype, a newShlexOperatorstype? It's doable, but I fear that would duplicate a lot of code, and make maintenance harder.
I was thinking keeping a single public API but internally switching to a different implementation if and only if at least one of the operators is multibyte.