nickel icon indicating copy to clipboard operation
nickel copied to clipboard

Non-ASCII identifier support

Open toastal opened this issue 2 years ago • 7 comments

Is your feature request related to a problem? Please describe.

English is a weird language. It was the basis of ASCII, but many languages—even ones also using the Latin script don’t fit inside its limited character set. As a result, there is a bias towards Latin characters [A-Za-z] without accents. Since there doesn’t appear to be a bicameral distinction requirement, all writing scripts can & probably should be a considered valid for a modern language that doesn’t have the legacy bias of older languages. As such I get unexpected token errors for situations that feel like they should be valid. Consider:

let Pokémon = {
	ID | std.number.PosNat,
	name | String,
	# …
} in

let SomeNorseGods = [| 'Odin, 'Freyr, 'Freyja, 'Þórr, 'Loki, 'Höðr, 'Sága |] in

let SomeGreekGods = [| 'Ἀφροδίτη, 'Ἀπόλλων, 'Ἄρης, 'Περσεφόνη |] in

let Buds = {
	คิว = { },
	แชมป์ = { },
	เมฆ = { },
} in

{ }

This gets unexpected token errors despite being valid (according to humans) writing scripts.

Describe the solution you'd like

If it’s a ‘letter’ in a writing system block, it’s valid. I understand errors for names with spaces or ‘symbol’ but all writing systems should be valid.

Describe alternatives you've considered

  • ‘Romanize’ everything (tho this can lead to errors as many languages distinguish between ‘e’ & ‘é’) & deburr.
  • Convert everything to English since English tends to remove all accents since English’s writing system is already a mess & since words aren’t phonemic, its speakers are used to memorizing weird or misspelled borrowings from other languages (tho exceptions where words like naïve & façade & jalapeño are often spelled with their accents which would still fail).

Additional context

toastal avatar Aug 17 '23 09:08 toastal

For reference, we could and maybe should follow Unicode Standard Annex #31 like Rust itself is for identifiers.

vkleen avatar Aug 17 '23 10:08 vkleen

Just to note that this was discussed in the weekly meeting, and it seems easily doable (the hardest part seems to be the vim highlighting).

jneem avatar Aug 18 '23 15:08 jneem

But Tree-sitter for Neovim is still relatively easy?

toastal avatar Aug 18 '23 18:08 toastal

Tree-sitter supports unicode character classes in its regexes, so we should be able to just use XID_START and XID_CONTINUE

jneem avatar Aug 18 '23 19:08 jneem

Maybe that's the occasion to get rid of the vim plugin and advise people to use the tree-sitter grammar instead? That would be one less grammar to maintain.

yannham avatar Aug 28 '23 10:08 yannham

If there is a vim-tree-sitter or similar plugin, I wouldn’t see the harm. Despite using Neovim for like 8 years, cutting OG Vim support would be problematic.

toastal avatar Aug 28 '23 10:08 toastal

For reference, we could and maybe should follow Unicode Standard Annex #31 like Rust itself is for identifiers.

That's too limiting as although it does allows weirdness like 𓀈𓀀 it wouldn't allow more helpful config names like move←, move→ move↑, move↓

eugenesvk avatar Dec 15 '23 20:12 eugenesvk