openCypher
openCypher copied to clipboard
Issues with whitespace definition
Neither Java's Character.isWhitespace(int)
, or Character.isSpaceChar(int)
, or the unicode [:White_Space:]
specification treats \u180E
(MONGOLIAN VOWEL SEPARATOR) as a whitespace.
Yet the openCypher grammar considers this a whitespace character, why? https://github.com/opencypher/openCypher/blob/346aa0d6ec0d500c84bcc08edb633ba9d8815e73/grammar/basic-grammar.xml#L781
Furthermore the definition of whitespace in the openCypher grammar does not consider \u0085
(NEXT LINE) to be whitespace, while it is part of the unicode [:White_Space:]
specification. Perhaps that should be added? (it is not considered a whitespace by either Character.isWhitespace(int)
or Character.isSpaceChar(int)
, which explains why it is not in the grammar).
I came across this difference when looking at why the whitespace
production rules spelled out all whitespace characters individually instead of just referencing the unicode [:White_Space:]
specification. So I investigated the difference.
The conclusion of this exercise is that apart from \u0085
(NEXT LINE), the grammar includes all characters of the unicode [:White_Space:]
specification, and additionally includes \u001C
(FILE SEPARATOR), \u001D
(GROUP SEPARATOR), \u001E
(RECORD SEPARATOR), and \u001F
(UNIT SEPARATOR).
Tabulating the characters involved:
Code Point | Character.isWhitespace(...) |
Character.isSpaceChar(...) |
[:White_Space:] |
---|---|---|---|
\u0009 |
True | False | True |
\u000a |
True | False | True |
\u000b |
True | False | True |
\u000c |
True | False | True |
\u000d |
True | False | True |
\u001c |
True | False | False |
\u001d |
True | False | False |
\u001e |
True | False | False |
\u001f |
True | False | False |
\u0020 |
True | True | True |
\u0085 |
False | False | True |
\u00a0 |
False | True | True |
\u1680 |
True | True | True |
~~\u180E ~~ |
~~True (Java 8)~~ | ~~True (Java 8)~~ | ~~True (Unicode 4.0 - 6.2)~~ |
~~\u180E ~~ |
~~False (Java 11)~~ | ~~False (Java 11)~~ | ~~False (Unicode 3.0 - 3.2; 6.3 -)~~ |
\u2000 |
True | True | True |
\u2001 |
True | True | True |
\u2002 |
True | True | True |
\u2003 |
True | True | True |
\u2004 |
True | True | True |
\u2005 |
True | True | True |
\u2006 |
True | True | True |
\u2007 |
False | True | True |
\u2008 |
True | True | True |
\u2009 |
True | True | True |
\u200a |
True | True | True |
\u2028 |
True | True | True |
\u2029 |
True | True | True |
\u202f |
False | True | True |
\u205f |
True | True | True |
\u3000 |
True | True | True |
If we agree to use the unicode [:White_Space:]
specification, we could define whitespace as:
<production name="whitespace">
<alt>
<character set="White_Space"/>
<character set="FS"/>
<character set="GS"/>
<character set="RS"/>
<character set="US"/>
</alt>
</production>
Looking at commit history, it appears as if at some point Java's Character.isWhitespace(int)
treated \u180E
(MONGOLIAN VOWEL SEPARATOR) as a whitespace. At least that is what the code comments say. And indeed, in Java 8 it is included, but in Java 11 it is not.
I think it makes good sense to stick with Unicode here. Do we even need the special additions of FS, GS, RS and US?
The FILE SEPARATOR
, GROUP SEPARATOR
, RECORD SEPARATOR
, and UNIT SEPARATOR
have been explicitly treated as whitespace by Java since forever, and thus by the Neo4j Cypher parser.
They are likely to not occur in Cypher queries. I'd say it's harmless to either include or exclude them.
I agree. I would lean towards going with Unicode rather than Java (and abandon Cypher's implementation history), but I don't feel strongly about it. I wonder if any of the two alternatives makes a difference for implementability? I doubt it.
See #530