brigadier [BUG] `StringReader.readUnquotedString()` does not support non-ASCII characters

I'm the maintainer of Chimera, a library that allows the Brigadier command framework to be used in Spigot plugins. Recently, a developer opened an interesting issue in which they reported that argument parsing failed when non-ASCII characters were specified.

Sample error

Looking into the issue, the cause of the argument parsing failing for non-ASCII characters can be traced to:

    // StringArgumentType's parse() method
    @Override
    public String parse(final StringReader reader) throws CommandSyntaxException {
        if (type == StringType.GREEDY_PHRASE) {
            final String text = reader.getRemaining();
            reader.setCursor(reader.getTotalLength());
            return text;
        } else if (type == StringType.SINGLE_WORD) {
            return reader.readUnquotedString(); // <-- calls this
        } else {
            return reader.readString();
        }
    }
    
    // StringReader's readUnquotedString() method
    public String readUnquotedString() {
        final int start = cursor;
        while (canRead() && isAllowedInUnquotedString(peek())) { // <-- calls this
            skip();
        }
        return string.substring(start, cursor);
    }
    
    // Cause of failure
    public static boolean isAllowedInUnquotedString(final char c) {
        return c >= '0' && c <= '9'
            || c >= 'A' && c <= 'Z'
            || c >= 'a' && c <= 'z'
            || c == '_' || c == '-'
            || c == '.' || c == '+';
    }

This affects StringArgumentType.word(), StringArgumentType.string(), StringReader.readString() and StringReader.readUnquotedString() and any other dependencies on these classes/methods.

Maybe I'm missing some context but it seems weird to only allow a small subset of ASCII characters. In my opinion, this implementation of StringReader.readUnquotedString() is extremely wonky and should be refactored to support non-ASCII characters.

In the interim, I decided to create a simple utility method that reads strings until a whitespace is encountered. It behaves similarly to StringReader.readUnquotedString() while supporting special characters.

Source

    public static String unquoted(StringReader reader) {
        var start = reader.getCursor();
        while (reader.canRead() && reader.peek() != ' ') {
            reader.skip();
        }
        
        return reader.getString().substring(start, reader.getCursor());
    }

Nov 07 '21 12:11 Pante

Unquoted strings are intentionally only a certain set of characters, anything beyond those are considered "special characters" and require quoted strings

This is to avoid unquoted strings greedily parsing things that are meant to be parsed as other syntax

Nov 07 '21 21:11 vdvman1

Unquoted strings are intentionally only a certain set of characters, anything beyond those are considered "special characters" and require quoted strings

This is to avoid unquoted strings greedily parsing things that are meant to be parsed as other syntax

Would you mind elaborating on the second point please? I can't think of a case where parsing special characters may unintentionally greedily parse arguments. Even if that was the case, it seems like a abuse of StringReader.readStringQuote() where StringReader.readStringUntil(char) should be used instead.

It also seems extremely unintuitive that "топчик" will be rejected, it doesn't really contain any "special characters" that will need to be avoided to prevent greedy parsing.

Nov 08 '21 03:11 Pante

I agree that it should be expanded to better support international text.

Unquoted strings are used in SNBT values in compounds, where the character after could be either a , for another key:value pair, or a } for ending the compound. That could indeed be fixed by a more general readStringUntil that can accept multiple characters, this is just the reason for the original definition of readUnquotedString

Nov 08 '21 11:11 vdvman1

I believe that it is necessary to add support for at least Cyrillic, since many servers in post-Soviet countries use Cyrillic commands and a lot of players complain that they have to write " characters so that the commands works

Nov 08 '21 14:11 ghost