doc
doc copied to clipboard
$_r_-a_-k_u-_'_s-_ identifiers
https://docs.raku.org/syntax/identifiers says:
An ordinary identifier is composed of a leading alphabetic character which may be followed by one or more alphanumeric characters. It may also contain isolated, embedded apostrophes ' and/or hyphens -, provided that the next character is each time alphabetic.
I suggest this as more nearly correct:
An ordinary identifier is composed following two rules, and from four kinds of characters. The character kinds are:
- alphabetic characters,
- numeric characters,
- underscore,
- and the separators which are the apostrophe and the hyphen.
The rules:
- Each separator separates two sections.
- Every section must start with an alphabetic character or underscore.
Empty sections are implied, which might be made explicit. '<'alpha'>' is not alphabetic. The topic var supersedes the above.
Presuming that Rakudo behavior is correct, the passage quoted is wrong.
I started this issue as just a language clean up. I subscribe to the people don't like to read proposition. I would prefer that the first sentence more strongly motivate the reader to read further.
During that effort I noticed that the docs are wrong or Rakudo is wrong.
The major nit is that a learner is very apt to want to create identifiers much before they care to learn about regexes and grammars. So conflating <alpha> with alphabetic or <alnum> with alphanumeric is wrong.
My phrase topic var supercedes the above was a waste of words.
Perhaps:
An ordinary identifier is a group of characters. The group is a leading underscore or alphabetic character which may be followed by one or more underscores or alphanumeric characters. It may also be multiple such groups separated by a single apostrophes ' or hyphens-.
Maybe?
An ordinary_identifier is one or more characters. Beyond single-character ordinary_identifiers (which are alphabetic but not underscore (i.e.
<alpha>minus_, or more simply<:L>), characters are composed into groups wherein a leading underscore or alphabetic character ( i.e.<alpha>+) may be followed by zero or more underscores or alphanumeric characters (i.e.<alnum>*). Even more expressively, an ordinary_identifier may also be multiple such<alpha>+<alnum>*groups separated by a single apostrophe'or hyphen-, leading to quite natural and concise variable identification.
Note (from the docs): Somewhat confusingly, a predefined regex <ident> is known as a "Basic identifier" and is found at https://docs.raku.org/language/regexes#Predefined_Regexes . It has no support for ' or -:
<ident>Basic identifier (no support for C<'> or C<->). Same as C« <.alpha> \w* »
Do we (should we?) distinguish a basic_identifier from an ordinary_identifier ?
https://docs.raku.org/language/regexes#Predefined_Regexes
<ident>Basic identifier (no support for C<'> or C<->). Same as C« <.alpha> \w* »
A basic_identifier is one or more characters. Beyond single-character basic_identifiers (which are alphabetic but not underscore (i.e.
<alpha>minus_, or more simply<:L>), characters are composed into one-or-more groups wherein a leading underscore or alphabetic character ( i.e.<alpha>+) may be followed by zero or more underscores or alphanumeric characters (i.e.<alnum>*). Conceptually one may think of basic_identifiers as[<alpha>+<alnum>*]+, which includes various "camelCase" and "snake_case" forms.
Even more expressively, an ordinary_identifier may also be multiple such
<alpha>+<alnum>*groups separated by a single apostrophe'or hyphen-, leading to quite natural and concise variable identification.
Beyond single-character ordinary_identifiers (which are alphabetic but not underscore (i.e.
<alpha>minus_, or more simply<:L>)
I don't believe that the second part of that is correct: _ can be an identifier by itself, for example in my \_ = 42 or sub _ { }. ($_, @_, and %_ are already taken, but that doesn't mean that _ is an invalid identifier.) So I don't think we need to describe single-characters as a special case at all, which simplifies things a bit.
Wow, I never would have guessed.
I'm not sure I agree with that design decision (presuming it was a conscious decision), to re-make the entire language and still allow something like my \_ = 42 or sub _ { }. I would have guessed my \_ would have been disallowed, or possibly reserved. Same with sub _ { }.
But I better stop here, before I offend someone.
I'm not sure I agree with that design decision (presuming it was a conscious decision), to re-make the entire language and still allow something like
my \_ = 42
I agree that my \_ = 42 is a bad idea – but we also allow even worse/more unreadable identifiers such as my \ᱹ = 42 (that's \x[1C79] in case it doesn't display properly in your font – a glyph that represents a letter according to Unicode). The thing about ᱹ and _ is that they're obviously bad choices for single-character variable names in virtually all cases, so I'm not too bothered by the fact that Raku allows them. That's especially true because forbidding _ as a single character variable would add complexity to the rules for what characters are allowed (as we've just been discussing).
Personally, I'm kind of glad Raku keeps the rules fairly simple and then trusts users not to abuse them other than as a prank.
FYI, I did a somewhat extensive analysis of <ident> and "identifier" in this SO answer. The first section may be of interest starting at:
The built in
identrule does precisely the same as if it were declared as ...
and the section The rest of this answer provides a ToC to orient readers on what might be of interest in the rest of the answer.
My primary complaint against this doc is alphanumeric and alphabetic do not equate to <alnum> and <alpha>.
The bare fix:
An ordinary identifier is composed of a leading alphabetic or underscore character which may be followed by one or more alphanumeric and/or underscore characters. It may also contain isolated, embedded apostrophes ' and/or hyphens -, provided that the next character is each time alphabetic or the underscore.
Or, trying for a smoother ride and attempting to be less ASCII:
An ordinary identifier can be composed in two ways. One form is a letter or underscore which may be followed by more letters, underscores, and/or digits. The other form is multiple short forms separated by isolated apostrophes' and/or hyphens-.