[BUG] `StringReader.readUnquotedString()` does not support non-ASCII characters
I'm the maintainer of Chimera, a library that allows the Brigadier command framework to be used in Spigot plugins. Recently, a developer opened an interesting issue in which they reported that argument parsing failed when non-ASCII characters were specified.

Looking into the issue, the cause of the argument parsing failing for non-ASCII characters can be traced to:
// StringArgumentType's parse() method
@Override
public String parse(final StringReader reader) throws CommandSyntaxException {
if (type == StringType.GREEDY_PHRASE) {
final String text = reader.getRemaining();
reader.setCursor(reader.getTotalLength());
return text;
} else if (type == StringType.SINGLE_WORD) {
return reader.readUnquotedString(); // <-- calls this
} else {
return reader.readString();
}
}
// StringReader's readUnquotedString() method
public String readUnquotedString() {
final int start = cursor;
while (canRead() && isAllowedInUnquotedString(peek())) { // <-- calls this
skip();
}
return string.substring(start, cursor);
}
// Cause of failure
public static boolean isAllowedInUnquotedString(final char c) {
return c >= '0' && c <= '9'
|| c >= 'A' && c <= 'Z'
|| c >= 'a' && c <= 'z'
|| c == '_' || c == '-'
|| c == '.' || c == '+';
}
This affects StringArgumentType.word(), StringArgumentType.string(), StringReader.readString() and StringReader.readUnquotedString() and any other dependencies on these classes/methods.
Maybe I'm missing some context but it seems weird to only allow a small subset of ASCII characters. In my opinion, this implementation of StringReader.readUnquotedString() is extremely wonky and should be refactored to support non-ASCII characters.
In the interim, I decided to create a simple utility method that reads strings until a whitespace is encountered. It behaves similarly to StringReader.readUnquotedString() while supporting special characters.
public static String unquoted(StringReader reader) {
var start = reader.getCursor();
while (reader.canRead() && reader.peek() != ' ') {
reader.skip();
}
return reader.getString().substring(start, reader.getCursor());
}
Unquoted strings are intentionally only a certain set of characters, anything beyond those are considered "special characters" and require quoted strings
This is to avoid unquoted strings greedily parsing things that are meant to be parsed as other syntax
Unquoted strings are intentionally only a certain set of characters, anything beyond those are considered "special characters" and require quoted strings
This is to avoid unquoted strings greedily parsing things that are meant to be parsed as other syntax
Would you mind elaborating on the second point please? I can't think of a case where parsing special characters may unintentionally greedily parse arguments. Even if that was the case, it seems like a abuse of StringReader.readStringQuote() where StringReader.readStringUntil(char) should be used instead.
It also seems extremely unintuitive that "топчик" will be rejected, it doesn't really contain any "special characters" that will need to be avoided to prevent greedy parsing.
I agree that it should be expanded to better support international text.
Unquoted strings are used in SNBT values in compounds, where the character after could be either a , for another key:value pair, or a } for ending the compound.
That could indeed be fixed by a more general readStringUntil that can accept multiple characters, this is just the reason for the original definition of readUnquotedString
I believe that it is necessary to add support for at least Cyrillic, since many servers in post-Soviet countries use Cyrillic commands and a lot of players complain that they have to write " characters so that the commands works