graphql-spec icon indicating copy to clipboard operation
graphql-spec copied to clipboard

RFC: extend identifiers to support non-ASCII characters.

Open jarlestabell opened this issue 8 years ago • 12 comments

I like the concept of GraphQL, but I'm really surprised by the ASCII-limitation in the spec. Many years ago, that limitation made sense, but today it sounds like a blast from the past where lots of standards and tools only catered to English speaking users. Nowadays many database engines etc support non-ASCII characters in field and table names etc, making it a lesser experience for users of tools like graphiql if the names they see there has to be "escaped/uglified" versions of the real names.

jarlestabell avatar Jan 04 '17 15:01 jarlestabell

From the spec:

Names in GraphQL are limited to this ASCII subset of possible characters to support interoperation with as many other systems as possible.

Does anyone know of any tooling which requires that GraphQL names are ASCII?

Ruby-wise, non-ASCII would be fine.

rmosolgo avatar Jan 04 '17 16:01 rmosolgo

For modern languages like scala and swift it should be fine as well. But it may cause issues for Java, for instance. This potentially can be problematic for tools like apollo-ios which generate client code based on a query and GraphQL schema. I guess it should be fine for swift version of the codegen tool, but not for android/java version.

You can also check this related issue: https://github.com/facebook/graphql/issues/102

OlegIlyenko avatar Jan 05 '17 22:01 OlegIlyenko

I think it is strange that a draft specification in 2017 doesn't cater to other languages than English for its identifiers. I'm afraid this could be a hindrance to adoption in many parts of the world, or lead to "international" forks of the tools. (like graphql-go/graphql#153)

I'm not a Java developer, but I don't see why it should be problematic for Java, according to section 3.8 of the Java language spec: "Letters and digits may be drawn from the entire Unicode character set, which supports most writing scripts in use in the world today, including the large sets for Chinese, Japanese, and Korean. This allows programmers to use identifiers in their programs that are written in their native languages."

I've been playing with Sangria btw, thanks for great work @OlegIlyenko! :)

jarlestabell avatar Jan 09 '17 10:01 jarlestabell

The Go language spec is a good clear and concise example of defining non-ASCII identifiers. See the unicode_letter, unicode_digit, letter, and identifier productions starting at https://golang.org/ref/spec#Source_code_representation

If you're going to make this change, it's very important to make a conscious choice about whether or not to canonicalize certain characters that can be expressed as code points in multiple ways. For example, é can be expressed both as "'LATIN SMALL LETTER E WITH ACUTE' (U+00E9)" or as "LATIN SMALL LETTER E (U+0065)" followed by "COMBINING ACUTE ACCENT (U+0301)".

I would suggest that for simplicity of implementation, these are treated as distinct rather than require all implementations to normalize all identifiers to composed or decomposed form. That's the choice Go makes:

The text is not canonicalized, so a single accented code point is distinct from the same character constructed from combining an accent and a letter; those are treated as two code points.

Making the other choice is reasonable too, but leaving this undefined is a bad idea.

Here's some more background on normalization, from Unicode itself. They suggest that you should normalize, and suggest a relatively invasive form called NFKC or NFKD for identifiers. I'm not personally convinced that their arguments (especially for the "K" variants which substantially change the underlying characters) are convincing though. http://unicode.org/faq/normalization.html http://unicode.org/reports/tr15/

glasser avatar Jan 09 '17 19:01 glasser

BTW I don't know how I missed this but there seems to be a PR about this already: https://github.com/facebook/graphql/pull/231

stubailo avatar Jan 17 '17 04:01 stubailo

@stubailo #231 not related to names of identifiers:

However, with the exceptions of {StringValue} and {Comment}, most of GraphQL is expressed only in the original non-control ASCII range so as to be as widely compatible with as many existing tools, languages, and serialization formats as possible and avoid display issues in text editors and source control.

IvanGoncharov avatar Jan 17 '17 11:01 IvanGoncharov

@jarlestabell

lead to "international" forks of the tools. (like graphql-go/graphql#153)

This fork is about adding <>/.: characters and has nothing to do with internationalization:

lexer.go:107 this function accepts [_A-Za-z][_0-9A-Za-z]* charset. We want to expand it a bit with some special characters, such as <>/.:.

IMHO, even as a non-native English speaker, I don't see any benefits in supporting Unicode beyond comments and string values.

Moreover, it introduces a lot of problems, especially with code-generation. One example is Python:

Within the ASCII range (U+0001..U+007F), the valid characters for identifiers are the same as in Python 2.x: the uppercase and lowercase letters A through Z, the underscore _ and, except for the first character, the digits 0 through 9. Python 3.0 introduces additional characters from outside the ASCII range (see PEP 3131).

So Python2 doesn't support Unicode identifiers and it's not a unique situation: Dart has the same problem: https://github.com/dart-lang/sdk/issues/2608

Regarding Python3, it looks like it has no problems with Unicode. But there is one detail:

All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC.

That means if GraphQL doesn't force NFKC normalization it will cause name clashes in generated Python3 code. Moreover there are multiple normalization algorithms("NFC", "NFD", "NFKC", "NFKD") and choosing any of them leads to errors due to normalization differences.

Beyond code-generation, it may cause a lot of issues with debugging. For example, c(Latin) and с(Cyrillic) look the same and accidently typing one instead another is a pretty common mistake (they both occupy the same key on the keyboard) for developers from Slavic countries. Currently, it is easy to diagnose as GraphQL server produces an error similar to the following:

Syntax Error GraphQL request (2:6) Unexpected character \"\\u0441\"

But it would be really confusing to recieve this one:

Cannot query field \"__sсhema\" on type \"Root\".

Nowadays many database engines etc support non-ASCII characters in field and table names etc, making it a lesser experience for users of tools like graphiql if the names they see there has to be "escaped/uglified" versions of the real names.

@jarlestabell As a better alternative to "escaped/uglified" you can use transliteration libraries for example this one: https://github.com/andyhu/transliteration

I understand the general problem though. I think that forcing a few API owners to do manual names mapping is better than forcing many API clients to deal with the issues introduced by Unicode support. As an alternative, I can suggest adding displayName directive for field and type definitions which can be used to supply readable names for all non-developers.

IvanGoncharov avatar Jan 17 '17 21:01 IvanGoncharov

@IvanGoncharov Thanks for the detailed writeup. Do you think it would be more practical to introduce Unicode only for Enum names?

sorenbs avatar Jan 23 '17 17:01 sorenbs

Do you think it would be more practical to introduce Unicode only for Enum names?

@sorenbs At a glance supporting Unicode for enum values makes more sense however it introduces exactly the same problems I described in the previous comment.

Anyway, you still need to map enum values to user-friendly strings (without capitalization, underscores, and camelCase) before displaying it. So this limitation doesn't affect end users and working with translated/transliterated values isn't a problem for the majority of developers.

IvanGoncharov avatar Feb 02 '17 14:02 IvanGoncharov

Can enable support for unicode identifiers by option? I know it may be limit my use some language, but I actually need it. Transliteration is currently unable to meet my needs and may bring bigger problems.

s97712 avatar Dec 01 '18 10:12 s97712

I had to find another solution since most of my enums are not in English.

korenzerah avatar Feb 05 '19 08:02 korenzerah

We are building our GraphQL schema from an xml schema with non-ascii characters for identifiers (typically characters such as é, è). This limitation is very ennoying for us, since we allow this in our Java and database names.

jeromecambon avatar Nov 24 '21 09:11 jeromecambon