carbon-lang Character Literals (#1934)

trafficstars

Put character literals in single quotes, like 'a'. Character literals work like numeric literals:

Every different literal value has its own type.
The bit width is determined by the type of the variable the literal is assigned to, not the literal itself. Follows the plan from #1934.

Aug 08 '22 22:08 cabmeurer

@opelolo let me know when you want to start collaborating on this

Aug 09 '22 21:08 cabmeurer

Thank you for tackling this! I know I'm making a lot of suggestions, but that is totally normal for someone's first proposal. Writing proposals is definitely not easy, and it isn't always clear what we are looking for in them.

@josh11b No problem! I'm excited to learn and contribute where I can, I will start working on the fixes you suggested. Thank you for all your help!

Sep 03 '22 19:09 cabmeurer

Could you avoid force pushing everything into a single commit? It makes it harder to review just what has changed.

Sep 06 '22 23:09 josh11b

Could you avoid force pushing everything into a single commit? It makes it harder to review just what has changed.

Looks like @jonmeow explained it better here: https://github.com/carbon-language/carbon-lang/pull/2022#issuecomment-1239070023

Just a tip regarding force-pushes though -- those can break GitHub's comment associations, and it makes it harder to determine what's changed since the last review (with regular pushes, GitHub can show a delta). We're going to squash-and-merge PRs anyways, so if you use regular pushes the end result will actually be the same and it can make it easier to review. :)

Sep 07 '22 20:09 josh11b

@zygoloid @josh11b Thank you for all the help and feedback! Sorry about the delay, I have a lot going on this week but will continue to work on your comments/suggestion. Thank you for being so patient!

Sep 14 '22 00:09 cabmeurer

@josh11b @zygoloid Thank you so much for all the help! I enjoyed working on this and have learned so much about language design and the proposal process. I am looking forward to contributing further to Carbon and becoming more involved in the community. I really appreciate everything you two have done, it's exciting and truly motivating to see the passion, care, and attention to detail you both bring to the project, thank you again!

Oct 22 '22 01:10 cabmeurer

I feel like there are two different design paths for character literals that we should decide between and articulate a bit more clearly than the current proposal does. There are several minor comments that I have on this proposal as-is, but for them to make sense we really need to pick what model we're using. Right now, I feel like the proposal is somewhat of a mix of two models.

Text-fragment or grapheme-cluster model

One option is to steer (sharply) in the direction of "character" not making much sense as a construct in a world with Unicode encodings of text. What humans tend to think of as a "character" is more a "grapheme cluster" that is an arbitrarily long and complex encoding.

This is a very Unicode- and text-friendly model. But IMO, it makes thinking of characters as integers at all very awkward and surprising.

I think that if we want to go down this path, we should probably use separate spellings to represent a small set of operations that are today expressed with integer-based math on C++'s character literals -- things like converting an integer between 0 and 9 into the corresponding digit charecter (in C++, '0' + n), or computing the difference between two digits (or two other characters).

And we should own the fact that these are essentially string literals with two fairly minor and superficial differences:

A statement of intent by the programmer that it is used in a "character"-y way.
A type with some augmented APIs such as to handle the few common cases where C++ uses arithmetic.

And then we should provide a way to extract an actual integer from a codepoint in such a literal to handle esoteric needs.

Integer-like model

Another option is to embrace that when written as a literal and not as a string literal, character literals are expected to be an attempt to conjure an integer value that corresponds to some part of Unicode's encoding.

Here, code points actually make sense -- the entire intent is to write something integer-y, and that's what they are. The fact that humans and natural language will want something less integer-y and more grapheme-cluster-y for their "character" isn't relevant because you would just use string literals for that or anything else that wasn't explicitly trying to be integer-like.

Here, I think the ordered comparisions and arithmetic all make sense. Much of the current proposal seems closer to this model.

But I think we can and should really embrace this model if it is the one we want, and have these be explicitly restricted to codepoints and documented to come burdened with all the limitations therein. Code should use string literals when these restrictions aren't OK or aren't helpful.

Personally, I see merit in both of these. I have historically drifted toward the text-fragment approach, but I can really see the appeal of just being very explicit about using code points and integers. I think I actually lean towards that specifically because it feels slightly closer to C++ and because we already have string literals. But I think sorting this out is the first step for me.

Nov 09 '22 22:11 chandlerc

@chandlerc and I chatted about this a bit. It's not completely clear which model is best. In particular, the integer-like model loses us the ability to write code unit literals that aren't code point literals, and it's unclear to what extent that would be a concern in practice, but on the other hand, the integer-like model is simpler and avoids needing to distinguish between single-code-point character literals and the more general case.

We'd like to start with the integer-like model and see how that goes, leaving the door open to switching to the more general model down the line if we find there is a significant need for the additional functionality and attendant complexity. Compared to this proposal, that means:

For now, we reject character literals that don't contain exactly one code point.
For now, we disallow \x escapes in character literals, because they don't seem especially useful and we'd like to encourage \u escapes instead.

With those changes, I think we're both happy moving forward with this. I think it's fine for this proposal to still describe the general model, and to say somewhere early on that we're restricting to the single-code-point case so some of the later description covers cases that are not currently possible, rather than reworking the whole proposal on this basis.

Nov 09 '22 22:11 zygoloid

(Also I know there has been some design direction churn here as we've explored the full consequences of the different designs, and thanks for sticking with all of that! Language feature design is ... really challenging, in part because we have to explore pretty big spaces and work out which of many possible directions ends up working best in practice.)

Nov 16 '22 09:11 chandlerc

(Also I know there has been some design direction churn here as we've explored the full consequences of the different designs, and thanks for sticking with all of that! Language feature design is ... really challenging, in part because we have to explore pretty big spaces and work out which of many possible directions ends up working best in practice.)

No worries! This is awesome, language design has been a space I've been very interested in for a while and I'm excited to be a part of it. It's a great experience, the community is amazing, and I'm learning a lot from everyone. Thank you for all the help and support!

Nov 22 '22 02:11 cabmeurer

Talking with @josh11b, it seems like at least he, myself, and @zygoloid are not actually aiming at the same high level direction / strategy for how to approach this.

I think that is ultimately causing some of the occilation as we bounce back and forth on this PR, and sorry for that.

I think we need to get aligned on the goal here before we can realistically converge on the specific design. Especially between the leads. =D

I'd suggest trying to sync up among at least the three folks I've mentioned here, and ideally with @cabmeurer as well. Would Jan 5th (tomorrow) at 1pm PT / 4pm ET work for the relevant folks? If so, we can use the open discussion session for this. Maybe let's follow up on the #text Discord channel to finalize scheduling a time for us to talk through the strategy.

Jan 05 '23 01:01 chandlerc

@chandlerc could you take another look at this when you have a chance?

May 11 '23 00:05 cabmeurer

@zygoloid could you take a look at this when you have a chance?

May 26 '23 21:05 cabmeurer

FYI both @zygoloid and @chandlerc have been on vacation (I think Chandler still is), so it may take them a little while to get to this.

May 30 '23 17:05 geoffromer

Thank you everyone!

Jun 15 '23 21:06 cabmeurer

carbon-lang carbon-lang copied to clipboard

Character Literals (#1934)

Text-fragment or grapheme-cluster model

Integer-like model

carbon-lang
carbon-lang copied to clipboard