factor More terse syntax for character literals

Currently, the Factor syntax for character literals is very verbose when compared to other languages; CHAR: a where typically 'a' is sufficient in many other languages. Can we introduce a shorter syntax for character literals?

We had some discussion on the Discord, where we talked about some options, and its impacts. I'll summarize below.

Jul 26 '22 08:07 nomennescio

Implement character literals as library, with 'a' a syntax word the same as CHAR: a. That would add many words as aliases for the literals. @mrjbq7 didn't like that.

Jul 26 '22 08:07 nomennescio

@mrjbq7 mentioned adapting the lexer, but what syntax to use? I mentioned that it will reduce the names you can use for words, so this has to be carefully considered, as I think we don't want to exclude any use of ' in a word name. To support this, any name that doesn't adhere to the character syntax must not be an error, but a valid word name.

Both 'a' and 'a are considered. Question remains, how to handle long character names.

Jul 26 '22 08:07 nomennescio

Some variations that could be considered:

' snowman
'snowman'
` snowman
`snowman
char: snowman
ch: snowman
ch'snowman
char"snowman"

Note: some of our words have a single-quote at the end (e.g., foo, foo', and foo'') and we use that sometimes in the math-y sense of ( x -- x' ), so we might want that to still work.

Jul 26 '22 08:07 mrjbq7

I proposed a simple non-conflicting change to just use ALIAS: ' CHAR: , which gives ' a as syntax, just as short as 'a', and is similar to \ in its "escape" style. It doesn't need a change in the lexer, but doesn't prevent it either.

Jul 26 '22 08:07 nomennescio

We also should likely move away from using "character" to refer to "code points"... and then do we want "grapheme syntax"?

Jul 26 '22 08:07 mrjbq7

Some variations that could be considered:

...

Note: some of our words have a single-quote at the end (e.g., foo, foo', and foo'') and we use that sometimes in the math-y sense of ( x -- x' ), so we might want that to still work.

For sure we still need to be able to use ' in word names, and not only at the end. Of course a change in the lexer will make some word names "illegal", something to consider when trying to minimize that impact.

Jul 26 '22 08:07 nomennescio

And do we want to support embedded utf-8 code points or graphemes:

' ☃
'☃'
u'☃'
ch: ☃
char: ☃
unicode: ☃

Jul 26 '22 08:07 mrjbq7

And do we want to support embedded utf-8 code points or graphemes:

I think that would just be the logical thing to do, if we support full unicode anyway. I guess having a word ☃ would need still need to be legal, as it currently is.

As such, it wouldn't make sense either to have a different syntax for unicode code points, as all chars are integers anyway.

Jul 26 '22 08:07 nomennescio

We also should likely move away from using "character" to refer to "code points"... and then do we want "grapheme syntax"?

For me I already think "UTF-8 code point" when I see "character", it's just that the ASCII characters are embedded in Unicode and are the most used ones.

Can you explain what you mean by grapheme syntax?

Jul 26 '22 08:07 nomennescio

Some variations that could be considered:
' snowman
'snowman'
` snowman
`snowman
char: snowman
ch: snowman
ch'snowman
char"snowman"

In my opinion, the ones starting with ch are not terse enough, and don't improve upon current CHAR:. I also don't like using backquote, as it typically serves a different purpose in other languages; why introduce a conflicting convention?

That would leave ' a and 'a' from that list.

Jul 26 '22 08:07 nomennescio

When a more terse syntax is introduced, maybe it's an idea to check the Factor sources and see where it's more appropriate to replace integer literals with char literals? Or would that negatively impact compile times too much?

Aug 01 '22 07:08 nomennescio

You mean after we make a different character syntax? Or is there a cleanup we should do now?

On Aug 1, 2022, at 2:20 AM, nomennescio @.***> wrote:

When a more terse syntax is introduced, maybe it's an idea to check the Factor sources and see where it's more appropriate to replace integer literals with char literals? Or would that negatively impact compile times too much?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

Aug 01 '22 12:08 mrjbq7

After a different syntax. I've seen many places where ascii integer literals are used, but that might be an artefact of seeing words which does not show the code executed at compile time, that might have 'proper' character literals.

Aug 08 '22 20:08 nomennescio

How about modelling it after string literal syntax? 'a', ' ', '\u{snowman}', '☃'

Dec 09 '22 15:12 gifti258

Further complicated by whether it refers to a single code point or a glyph.

On Fri, Dec 9, 2022 at 7:11 AM Giftpflanze @.***> wrote:

How about modelling it after string literal syntax? 'a', ' ', '\u{snowman}'

— Reply to this email directly, view it on GitHub https://github.com/factor/factor/issues/2639#issuecomment-1344424005, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAF5A2IWUGGSV4U7IQH3BTWMNDYZANCNFSM54VBNMLA . You are receiving this because you were mentioned.Message ID: @.***>

Dec 09 '22 15:12 mrjbq7

Couldn't that easily be remedied by checking if there are 2 or more codepoints between the quotes and not parsing if there are?

Dec 09 '22 15:12 gifti258

Sure, I’m just saying we probably want literal glyph syntax, too? Or I suppose that could just be a string of length glyphs=1. Strings should have iterators over glyphs, code points, and encodings. On Dec 9, 2022, at 7:38 AM, Giftpflanze @.***> wrote: Couldn't that easily remedied by checking if there are 2 or more codepoints between the quotes and not parsing if there are?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

Dec 09 '22 15:12 mrjbq7

Personally, I would like the idea of C" for character literals. The double quote emphasizes the relation to strings, and it's similar to the already-existing P" syntax. I feel like character literals aren't common enough to dedicate ' to; it would be nice to have ' available for other purposes, or left open for the user to define.

Jan 03 '23 05:01 defaultxr

That would look like:

C"a"
C"☃"
C"snowman"
C" "
C"space"

Jan 04 '23 17:01 mrjbq7

C"unicodename" looks a bit too much like a string no ? I think the unicodename needs to look more like an identifier (not in quotes) or have a backslash next to it (universally recognized as a big warning that it's not a plain string) (like in Gifti's suggestion of '\u{snowman}'

Also, Is it clear for everyone what this syntax is supposed to return ? Is it a number (unicode codepoint) ? (I guess you mean unicode codepoint when you say UTF8 codepoint as utf8 is just one possible encoding of unicode). Or do you want to create a new type or abstraction for graphemes/listofunicode: something that represent a single unit of display ?

To keep things simple, I think that what this syntax calls "characters" should just be a single number: a unicode codepoint. Anything else is a string ?

ps: recap of definitions just to make communication easier

Character is an overloaded term that can mean many things.

a code point is a number which is given meaning by the Unicode standard.

A grapheme is a sequence of one or more code points that are displayed as a single, graphical unit

from https://stackoverflow.com/a/27331885/

Jan 05 '23 08:01 jonenst

Personally, I would like the idea of C" for character literals. The double quote emphasizes the relation to strings, and it's similar to the already-existing P" syntax. I feel like character literals aren't common enough to dedicate ' to; it would be nice to have ' available for other purposes, or left open for the user to define.

I understand your concern about being careful not to waste "special" characters, and I agree with that concern. However, the whole point of discussing an alternative to CHAR: is to have something more terse, and C"-" is only two characters less... I would prefer for now ', hence ' a, and would think with a new parser 'a' would also be fine. The latter would not waste special characters, because it's fully parsed, leaving prefixes with ' free for other purposes.

Jan 05 '23 09:01 nomennescio

I’m not sure the future parser would have a mode to allow ' an and 'a' be different forms, but I guess anything is possible. Seems like more frequently that would be a user error and providing better error messages would be more useful than supporting both syntax forms. On Jan 5, 2023, at 1:22 AM, nomennescio @.***> wrote:

Personally, I would like the idea of C" for character literals. The double quote emphasizes the relation to strings, and it's similar to the already-existing P" syntax. I feel like character literals aren't common enough to dedicate ' to; it would be nice to have ' available for other purposes, or left open for the user to define.

I understand your concern about being careful not to waste "special" characters, and I agree with that concern. However, the whole point of discussing an alternative to CHAR: is to have something more terse, and C"-" is only two characters less... I would prefer for now ', hence ' a, and would think with a new parser 'a' would also be fine. The latter would not waste special characters, because it's fully parsed, leaving prefixes with ' free for other purposes.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

Jan 05 '23 14:01 mrjbq7

However, the whole point of discussing an alternative to CHAR: is to have something more terse, and C"-" is only two characters less...

My thinking is also: how often are character literals really used in regular code? Is it enough to warrant dedicating a single-character shortcut to? In my experience, I don't use character literals very often, even when working in string-centric code.

Indeed, in the Factor source, CHAR: occurs in approximately the same order of magnitude of lines as P":

/usr/lib/factor$ grep -ir --include '*.factor' 'CHAR:' * | wc -l
1920

/usr/lib/factor$ grep -ir --include '*.factor' 'P"' * | wc -l
1206

Of course, it's just my opinion, and I will continue to use Factor either way.

Jan 06 '23 00:01 defaultxr

My thinking is also: how often are character literals really used in regular code? Is it enough to warrant dedicating a single-character shortcut to? In my experience, I don't use character literals very often, even when working in string-centric code. /usr/lib/factor$ grep -ir --include '*.factor' 'CHAR:' * | wc -l 1920

Almost 2000 times is still a lot, and possibly at some point literal integers are used instead of CHAR:, but in my opinion it's also about making the language more convenient for people coming from other languages, where character literals are so trivial, nobody even thinks about them, until they're confronted with CHAR:.

With a new parser, parsing 'a' would not dedicate a single char shortcut. However, to be fair, what other use could the ' shortcut have in Factor? We already have \, and support strings with ", so what would be a non-confusing use of '?

Jan 06 '23 10:01 nomennescio

I’m saying it would probably be better to have a “non-terminated character” error than support both types of syntax:' foo'a'I dunno, since this is all theoretical and I’m unsure about dedicating single quotes to a code point feature, thinking still about syntax options. On Jan 6, 2023, at 2:18 AM, nomennescio @.***> wrote:

My thinking is also: how often are character literals really used in regular code? Is it enough to warrant dedicating a single-character shortcut to? In my experience, I don't use character literals very often, even when working in string-centric code. /usr/lib/factor$ grep -ir --include '*.factor' 'CHAR:' * | wc -l 1920

Almost 2000 times is still a lot, but in my opinion it's also about making the language more convenient for people coming from other languages, where character literals are so trivial, nobody even thinks about them, until they're confronted with CHAR:. With a new parser, parsing 'a' would not dedicate a single char shortcut. However, to be fair, what other use could the ' shortcut have in Factor? We already have \, and support strings with ", so what would be a non-confusing use of '?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

Jan 06 '23 15:01 mrjbq7

I'm not proposing to have both types of syntax at the same time. Just that one is dependent on the new parser, which is not yet there. Even then both options can make sense, but I indeed only want to pick one.

Jan 06 '23 15:01 nomennescio

Oh, I get it. I think either example syntax could be supported in the current parser pretty easily.On Jan 6, 2023, at 7:27 AM, nomennescio @.***> wrote: I'm not proposing to have both types of syntax at the same time. Just that one is dependent on the new parser, which is not yet there. Even then both options can make sense, but I indeed only want to pick one.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

Jan 06 '23 15:01 mrjbq7

factor factor copied to clipboard

More terse syntax for character literals

factor
factor copied to clipboard