ink icon indicating copy to clipboard operation
ink copied to clipboard

Non-ASCII characters can't be part of knot and variable names?

Open fireton opened this issue 9 years ago • 13 comments

Seems like they can't. Which is sad since I writing in Russian. If whole ink script utilizes unicode maybe it's a good idea to allow non-ASCII characters to be included in knot and variables names?

fireton avatar Nov 10 '16 14:11 fireton

@joethephish Any comments? Sorry for pushing it but...

fireton avatar Nov 17 '16 07:11 fireton

Hmm, it's a good quesiton - my instinct is that we shouldn't allow non-ASCII, though perhaps that's more grounded in old school programming tradition for identifiers rather than any genuine good reason!

Apologies, I'm really busy working toward a milestone on our new game here at inkle, but if you or anyone else wanted to experiment with changing the behaviour, the relevant parse function is here: https://github.com/inkle/ink/blob/master/inklecate/InkParser/InkParser_Logic.cs#L267

Note that the main aim with writing the identifier parsing function is that it needs to be non-ambiguous with other parts of the language. So (obviously) you need to make sure it doesn't accept the kind of punctuation that would be used in other parts of ink.

If you were to accept Russian characters, I'd suggest an "opt-out" approach where you say "allow all characters except space and these symbols" as opposed to the current "opt-in", which says to specifically allow a-z, 0-9, _.

joethephish avatar Nov 17 '16 08:11 joethephish

Hrm, markdown support isn't a great idea, I don't think. Mainly because two of the most important markdown features, *bold* and **italic**, already conflict with ink's choice syntax.

I'm not entirely sure what you mean by the escaping - would the "master" format by markdown (escaping ink) or would the "master" format by ink (escaping markdown)?

joethephish avatar Nov 17 '16 16:11 joethephish

Ink can indeed escape characters:

* \*\*Markdown-style bold\*\* text

But not sure why you'd want to do that... doesn't look like the most sensible way to write! We're planning to add support for non-markdown syntax like _italic_ at some point, though it would be purely on the runtime side.

joethephish avatar Nov 18 '16 09:11 joethephish

If you were to accept Russian characters, I'd suggest an "opt-out" approach where you say "allow all characters except space and these symbols" as opposed to the current "opt-in", which says to specifically allow a-z, 0-9, _. Unicode contains a lot of characters and many of them not a letters. It would be tough task to filter out all those non-letters to "opt-out" them from identifiers.

In fact it would be a good idea to make this an optional setting. Like in beginning of the script to put something like:

IDENTIFIERS а-я, А-Я

Which would completely solve my problem. :) Unfortunately I'm not much into the C# so I very unlikely will code it myself and offer a pull request...

fireton avatar Nov 18 '16 09:11 fireton

Hello,

I've currently started working on this. I've considered the opt-out approach not to be a good idea, as one needs to enlist all the non-identifier characters, which is error prone since there are a vast number of non-identifiers in the unicode universe (or should I say the unicode multiverse). It would be easy to miss something and then get into a mess.

Instead I plan on implementing support for currated character ranges. The idea is that the author can explicitly allow a certain character range matching his language, rather than all possible characters. The ranges will be based on the Unicode table for the different cultures, some suitable exampes can be taken from here: http://jrgraphix.net/research/unicode_blocks.php.

In addition, @fireton's idea for manually including a given character range on-demand would seem reasonable, instead of having all the ranges precompiled or pre-allocated in memory. I am thinking of something like:

ENABLE CHRANGE "Cyrillic"

where the string would be unique name for the desired range. The supported ranges could be listed in documentation for reference. Also, this would allow for growing support of character ranges in the future, meaning we don't have to support all of them right away.

So, let me know if you like the idea.

PS: Please, ignore the stuff for the markdown support, it is something I want to work on, but does not feel relevant at all to the OP's issue here. Let's not pollute the character ranges topic with that, I'll make another issue / PR for the markdown support if someone is interested

stackh34p avatar Nov 24 '16 17:11 stackh34p

Thanks, @ivaylo5ev! With this addition ink will be even better tool for all non-programmers authors!

fireton avatar Nov 25 '16 09:11 fireton

Just to inform on the progress of this, I have been able to introduce several character ranges so far:

  • Extended Latin A
  • Extended Latin B
  • Cyrillic
  • Arabic

Soon to be defined:

  • Arabic
  • Hebrew
  • Armenian
  • eventually Greek

The latter take some work as the original unicode ranges need to be further curated in order to discard non-letter characters. For some of these I cannot be of much use directly as I do not know the respective languages and I am relying on the assistance of some friends of mine.

In addition, my changes are currently based on another PR of mine which so far seems to be a long lived one and I will eventually port them to the most up-to-date master in order to deliver them faster.

Also, I will need a few more NUnit tests to verify the feature. If all turns well, I will be done in a couple of weeks, hopefully before the Christmas and New Year vacation days.

Cheers

stackh34p avatar Dec 12 '16 13:12 stackh34p

@ivaylo5ev, any progress on this?

fireton avatar Mar 06 '17 12:03 fireton

@fireton I've been having some personal matters these few months. I will try to complete this in a couple of weeks, mostly it needs some unit tests and documentation.

stackh34p avatar Mar 14 '17 17:03 stackh34p

I am resuming work on this now. I am experiencing some issues with divert variables and divert names at the moment, which prevents some tests to pass. I need some time to check whether there is a deeper issue with ink itself or it is entirely caused by the new feature

stackh34p avatar Apr 10 '17 11:04 stackh34p

I have now managed to prepare a PR for that. I apologize for the long delay on this anticipated feature. I did not expect that either, but I had personal issues that prevented me to properly focus on this one and complete it in an earlier time frame. I hope the PR is received well and merged to the mainstream codebase.

stackh34p avatar Jul 13 '17 12:07 stackh34p

I tried inklecate version 0.9.0 and latin chars are not working for knots, for example:

-> começo
=== começo ===
Era uma vez...
-> END

I've started to work with "ink" just today, sorry if I missed something.

-- edit -- I build the latest version from this repo and it worked. I'm trying to use inky and didn't found how to test it with this build of inklecate.

@joethephish

bkmeneguello avatar Feb 19 '21 21:02 bkmeneguello