factor
factor copied to clipboard
More terse syntax for character literals
Currently, the Factor syntax for character literals is very verbose when compared to other languages; CHAR: a
where typically 'a'
is sufficient in many other languages. Can we introduce a shorter syntax for character literals?
We had some discussion on the Discord, where we talked about some options, and its impacts. I'll summarize below.
Implement character literals as library, with 'a'
a syntax word the same as CHAR: a
. That would add many words as aliases for the literals. @mrjbq7 didn't like that.
@mrjbq7 mentioned adapting the lexer, but what syntax to use? I mentioned that it will reduce the names you can use for words, so this has to be carefully considered, as I think we don't want to exclude any use of '
in a word name. To support this, any name that doesn't adhere to the character syntax must not be an error, but a valid word name.
Both 'a'
and 'a
are considered. Question remains, how to handle long character names.
Some variations that could be considered:
' snowman
'snowman'
` snowman
`snowman
char: snowman
ch: snowman
ch'snowman
char"snowman"
Note: some of our words have a single-quote at the end (e.g., foo
, foo'
, and foo''
) and we use that sometimes in the math-y sense of ( x -- x' )
, so we might want that to still work.
I proposed a simple non-conflicting change to just use ALIAS: ' CHAR:
, which gives ' a
as syntax, just as short as 'a'
, and is similar to \
in its "escape" style. It doesn't need a change in the lexer, but doesn't prevent it either.
We also should likely move away from using "character" to refer to "code points"... and then do we want "grapheme syntax"?
Some variations that could be considered:
...
Note: some of our words have a single-quote at the end (e.g.,
foo
,foo'
, andfoo''
) and we use that sometimes in the math-y sense of( x -- x' )
, so we might want that to still work.
For sure we still need to be able to use '
in word names, and not only at the end. Of course a change in the lexer will make some word names "illegal", something to consider when trying to minimize that impact.
And do we want to support embedded utf-8 code points or graphemes:
' ☃
'☃'
u'☃'
ch: ☃
char: ☃
unicode: ☃
And do we want to support embedded utf-8 code points or graphemes:
I think that would just be the logical thing to do, if we support full unicode anyway. I guess having a word ☃
would need still need to be legal, as it currently is.
As such, it wouldn't make sense either to have a different syntax for unicode code points, as all chars are integers anyway.
We also should likely move away from using "character" to refer to "code points"... and then do we want "grapheme syntax"?
For me I already think "UTF-8 code point" when I see "character", it's just that the ASCII characters are embedded in Unicode and are the most used ones.
Can you explain what you mean by grapheme syntax?
Some variations that could be considered:
' snowman 'snowman' ` snowman `snowman char: snowman ch: snowman ch'snowman char"snowman"
In my opinion, the ones starting with ch
are not terse enough, and don't improve upon current CHAR:
. I also don't like using backquote, as it typically serves a different purpose in other languages; why introduce a conflicting convention?
That would leave ' a
and 'a'
from that list.
When a more terse syntax is introduced, maybe it's an idea to check the Factor sources and see where it's more appropriate to replace integer literals with char literals? Or would that negatively impact compile times too much?
You mean after we make a different character syntax? Or is there a cleanup we should do now?
On Aug 1, 2022, at 2:20 AM, nomennescio @.***> wrote:
When a more terse syntax is introduced, maybe it's an idea to check the Factor sources and see where it's more appropriate to replace integer literals with char literals? Or would that negatively impact compile times too much?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.
After a different syntax. I've seen many places where ascii integer literals are used, but that might be an artefact of see
ing words which does not show the code executed at compile time, that might have 'proper' character literals.
How about modelling it after string literal syntax? 'a'
, ' '
, '\u{snowman}'
, '☃'
Further complicated by whether it refers to a single code point or a glyph.
On Fri, Dec 9, 2022 at 7:11 AM Giftpflanze @.***> wrote:
How about modelling it after string literal syntax? 'a', ' ', '\u{snowman}'
— Reply to this email directly, view it on GitHub https://github.com/factor/factor/issues/2639#issuecomment-1344424005, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAF5A2IWUGGSV4U7IQH3BTWMNDYZANCNFSM54VBNMLA . You are receiving this because you were mentioned.Message ID: @.***>
Couldn't that easily be remedied by checking if there are 2 or more codepoints between the quotes and not parsing if there are?
Sure, I’m just saying we probably want literal glyph syntax, too? Or I suppose that could just be a string of length glyphs=1. Strings should have iterators over glyphs, code points, and encodings. On Dec 9, 2022, at 7:38 AM, Giftpflanze @.***> wrote: Couldn't that easily remedied by checking if there are 2 or more codepoints between the quotes and not parsing if there are?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
Personally, I would like the idea of C"
for character literals. The double quote emphasizes the relation to strings, and it's similar to the already-existing P"
syntax. I feel like character literals aren't common enough to dedicate '
to; it would be nice to have '
available for other purposes, or left open for the user to define.
That would look like:
C"a"
C"☃"
C"snowman"
C" "
C"space"
C"unicodename" looks a bit too much like a string no ? I think the unicodename needs to look more like an identifier (not in quotes) or have a backslash next to it (universally recognized as a big warning that it's not a plain string) (like in Gifti's suggestion of '\u{snowman}'
Also, Is it clear for everyone what this syntax is supposed to return ? Is it a number (unicode codepoint) ? (I guess you mean unicode codepoint when you say UTF8 codepoint as utf8 is just one possible encoding of unicode). Or do you want to create a new type or abstraction for graphemes/listofunicode: something that represent a single unit of display ?
To keep things simple, I think that what this syntax calls "characters" should just be a single number: a unicode codepoint. Anything else is a string ?
ps: recap of definitions just to make communication easier
- Character is an overloaded term that can mean many things.
- a code point is a number which is given meaning by the Unicode standard.
- A grapheme is a sequence of one or more code points that are displayed as a single, graphical unit
from https://stackoverflow.com/a/27331885/
Personally, I would like the idea of
C"
for character literals. The double quote emphasizes the relation to strings, and it's similar to the already-existingP"
syntax. I feel like character literals aren't common enough to dedicate'
to; it would be nice to have'
available for other purposes, or left open for the user to define.
I understand your concern about being careful not to waste "special" characters, and I agree with that concern.
However, the whole point of discussing an alternative to CHAR:
is to have something more terse, and C"
-"
is only two characters less...
I would prefer for now '
, hence ' a
, and would think with a new parser 'a'
would also be fine. The latter would not waste special characters, because it's fully parsed, leaving prefixes with '
free for other purposes.
I’m not sure the future parser would have a mode to allow ' an and 'a' be different forms, but I guess anything is possible. Seems like more frequently that would be a user error and providing better error messages would be more useful than supporting both syntax forms. On Jan 5, 2023, at 1:22 AM, nomennescio @.***> wrote:
Personally, I would like the idea of C" for character literals. The double quote emphasizes the relation to strings, and it's similar to the already-existing P" syntax. I feel like character literals aren't common enough to dedicate ' to; it would be nice to have ' available for other purposes, or left open for the user to define.
I understand your concern about being careful not to waste "special" characters, and I agree with that concern. However, the whole point of discussing an alternative to CHAR: is to have something more terse, and C"-" is only two characters less... I would prefer for now ', hence ' a, and would think with a new parser 'a' would also be fine. The latter would not waste special characters, because it's fully parsed, leaving prefixes with ' free for other purposes.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
However, the whole point of discussing an alternative to CHAR: is to have something more terse, and C"-" is only two characters less...
My thinking is also: how often are character literals really used in regular code? Is it enough to warrant dedicating a single-character shortcut to? In my experience, I don't use character literals very often, even when working in string-centric code.
Indeed, in the Factor source, CHAR:
occurs in approximately the same order of magnitude of lines as P"
:
/usr/lib/factor$ grep -ir --include '*.factor' 'CHAR:' * | wc -l
1920
/usr/lib/factor$ grep -ir --include '*.factor' 'P"' * | wc -l
1206
Of course, it's just my opinion, and I will continue to use Factor either way.
My thinking is also: how often are character literals really used in regular code? Is it enough to warrant dedicating a single-character shortcut to? In my experience, I don't use character literals very often, even when working in string-centric code. /usr/lib/factor$ grep -ir --include '*.factor' 'CHAR:' * | wc -l 1920
Almost 2000 times is still a lot, and possibly at some point literal integers are used instead of CHAR:
, but in my opinion it's also about making the language more convenient for people coming from other languages, where character literals are so trivial, nobody even thinks about them, until they're confronted with CHAR:
.
With a new parser, parsing 'a'
would not dedicate a single char shortcut. However, to be fair, what other use could the '
shortcut have in Factor? We already have \
, and support strings with "
, so what would be a non-confusing use of '
?
I’m saying it would probably be better to have a “non-terminated character” error than support both types of syntax:' foo'a'I dunno, since this is all theoretical and I’m unsure about dedicating single quotes to a code point feature, thinking still about syntax options. On Jan 6, 2023, at 2:18 AM, nomennescio @.***> wrote:
My thinking is also: how often are character literals really used in regular code? Is it enough to warrant dedicating a single-character shortcut to? In my experience, I don't use character literals very often, even when working in string-centric code. /usr/lib/factor$ grep -ir --include '*.factor' 'CHAR:' * | wc -l 1920
Almost 2000 times is still a lot, but in my opinion it's also about making the language more convenient for people coming from other languages, where character literals are so trivial, nobody even thinks about them, until they're confronted with CHAR:. With a new parser, parsing 'a' would not dedicate a single char shortcut. However, to be fair, what other use could the ' shortcut have in Factor? We already have \, and support strings with ", so what would be a non-confusing use of '?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
I'm not proposing to have both types of syntax at the same time. Just that one is dependent on the new parser, which is not yet there. Even then both options can make sense, but I indeed only want to pick one.
Oh, I get it. I think either example syntax could be supported in the current parser pretty easily.On Jan 6, 2023, at 7:27 AM, nomennescio @.***> wrote: I'm not proposing to have both types of syntax at the same time. Just that one is dependent on the new parser, which is not yet there. Even then both options can make sense, but I indeed only want to pick one.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>