openCypher Issues with whitespace definition

trafficstars

Neither Java's Character.isWhitespace(int), or Character.isSpaceChar(int), or the unicode [:White_Space:] specification treats \u180E (MONGOLIAN VOWEL SEPARATOR) as a whitespace.

Yet the openCypher grammar considers this a whitespace character, why? https://github.com/opencypher/openCypher/blob/346aa0d6ec0d500c84bcc08edb633ba9d8815e73/grammar/basic-grammar.xml#L781

Furthermore the definition of whitespace in the openCypher grammar does not consider \u0085 (NEXT LINE) to be whitespace, while it is part of the unicode [:White_Space:] specification. Perhaps that should be added? (it is not considered a whitespace by either Character.isWhitespace(int) or Character.isSpaceChar(int), which explains why it is not in the grammar).

Apr 04 '19 10:04 thobe

I came across this difference when looking at why the whitespace production rules spelled out all whitespace characters individually instead of just referencing the unicode [:White_Space:] specification. So I investigated the difference.

The conclusion of this exercise is that apart from \u0085 (NEXT LINE), the grammar includes all characters of the unicode [:White_Space:] specification, and additionally includes \u001C (FILE SEPARATOR), \u001D (GROUP SEPARATOR), \u001E (RECORD SEPARATOR), and \u001F (UNIT SEPARATOR).

Tabulating the characters involved:

Code Point	`Character.isWhitespace(...)`	`Character.isSpaceChar(...)`	`[:White_Space:]`
`\u0009`	True	False	True
`\u000a`	True	False	True
`\u000b`	True	False	True
`\u000c`	True	False	True
`\u000d`	True	False	True
`\u001c`	True	False	False
`\u001d`	True	False	False
`\u001e`	True	False	False
`\u001f`	True	False	False
`\u0020`	True	True	True
`\u0085`	False	False	True
`\u00a0`	False	True	True
`\u1680`	True	True	True
~~`\u180E`~~	~~True (Java 8)~~	~~True (Java 8)~~	~~True (Unicode 4.0 - 6.2)~~
~~`\u180E`~~	~~False (Java 11)~~	~~False (Java 11)~~	~~False (Unicode 3.0 - 3.2; 6.3 -)~~
`\u2000`	True	True	True
`\u2001`	True	True	True
`\u2002`	True	True	True
`\u2003`	True	True	True
`\u2004`	True	True	True
`\u2005`	True	True	True
`\u2006`	True	True	True
`\u2007`	False	True	True
`\u2008`	True	True	True
`\u2009`	True	True	True
`\u200a`	True	True	True
`\u2028`	True	True	True
`\u2029`	True	True	True
`\u202f`	False	True	True
`\u205f`	True	True	True
`\u3000`	True	True	True

Apr 04 '19 11:04 thobe

If we agree to use the unicode [:White_Space:] specification, we could define whitespace as:

<production name="whitespace">
  <alt>
    <character set="White_Space"/>
    <character set="FS"/>
    <character set="GS"/>
    <character set="RS"/>
    <character set="US"/>
  </alt>
</production>

Apr 04 '19 12:04 thobe

Looking at commit history, it appears as if at some point Java's Character.isWhitespace(int) treated \u180E (MONGOLIAN VOWEL SEPARATOR) as a whitespace. At least that is what the code comments say. And indeed, in Java 8 it is included, but in Java 11 it is not.

Apr 04 '19 12:04 thobe

I think it makes good sense to stick with Unicode here. Do we even need the special additions of FS, GS, RS and US?

Apr 04 '19 12:04 Mats-SX

The FILE SEPARATOR, GROUP SEPARATOR, RECORD SEPARATOR, and UNIT SEPARATOR have been explicitly treated as whitespace by Java since forever, and thus by the Neo4j Cypher parser.

They are likely to not occur in Cypher queries. I'd say it's harmless to either include or exclude them.

Apr 04 '19 13:04 thobe

I agree. I would lean towards going with Unicode rather than Java (and abandon Cypher's implementation history), but I don't feel strongly about it. I wonder if any of the two alternatives makes a difference for implementability? I doubt it.

Apr 05 '19 08:04 Mats-SX

See #530

Mar 18 '22 14:03 hvub

openCypher openCypher copied to clipboard

Issues with whitespace definition

openCypher
openCypher copied to clipboard