csharplang
csharplang copied to clipboard
[Proposal]: Only Allow Lexical Keywords in the Language
Only Allow Lexical Keywords in the Language
- [x] Proposed
- [ ] Prototype: Not Started
- [ ] Implementation: Not Started
- [ ] Specification: Not Started
Summary
Today there are keywords in the language that cannot be understood with just lexical information such as var
or nameof
. These keywords operate this way for backwards compatibility reasons. I propose that we change the language so that there exist no keywords that cannot be determined via lexical analysis.
Motivation
In general, C# strives to be a language that is explicit about program behavior and normally requires the developer to write out what their intention is without ambiguities. This makes it a language that is easy for someone to read and understand. Once you learn it there is little implicit behavior to consider. I believe having this "gotcha" where keywords are only keywords if nothing else in scope is so-named makes the language harder to read and reason about in general.
There is also the reality that the design goals around C# Language versions and .NET Framework versions have changed. In the past it was paramount that developers could take a new language version update without updating the framework version they were targeting. With language features being increasingly tied to the runtime this makes less sense. We now strongly encourage developers to update both the language version and target framework together.
Detailed design
There is an existing concept in the language called "contextual keywords". I am not proposing doing away with this concept altogether just changing it so that a keyword's "contextual-ness" is always able to be determined lexically. Take new (at the time of this writing) keyword record
. We can still know if we are referring to the record
keyword or some identifier named record based on the lexical context, there is no ambiguity. However var
, according to the spec, requires us to check if there are types named var in scope:
In the context of a local variable declaration, the identifier var acts as a contextual keyword. When the local_variable_type is specified as var and no type named var is in scope,
Similarly, nameof
requires checking if a there are any identifiers called nameof in scope
Because nameof is not a reserved keyword, a nameof expression is always syntactically ambiguous with an invocation of the simple name nameof. For compatibility reasons, if a name lookup of the name nameof succeeds, the expression is treated as an invocation_expression -- regardless of whether the invocation is legal. Otherwise it is a nameof_expression.
The implementation of this proposal would remove wording from the spec around name lookup collisions, and have a compliant compiler be able to fully determine keywords given only parsing information.
The following keywords would now error if developers attempted to use them as anything other than a keyword
-
var
-
nameof
-
dynamic
-
_
Drawbacks
This is a breaking change, if anyone were relying on this behavior in their code it would no longer compile. For cases where a type is named var
, dynamic
, or _
or a method is called nameof
the developer would need to change the usages to @var
, @nameof
, @dynamic
, or @_
.
Alternatives
We could opt to keep _
as a contextual keyword that depends on name lookup rules as this is the change most likely to break real-world programs (see discussion on https://github.com/dotnet/csharplang/issues/1064)
Unresolved questions
Design meetings
https://github.com/dotnet/csharplang/blob/main/meetings/2022/LDM-2022-09-28.md#ungrouped https://github.com/dotnet/csharplang/blob/main/meetings/2024/LDM-2024-09-06.md#only-allow-lexical-keywords-in-the-language
Previous discussion: https://github.com/dotnet/csharplang/discussions/4458.
I feel that the breaking changes that would be introduced by adopting this proposal on its own fall into three categories:
- "Nobody is broken."
Almost nobody uses
var
,dynamic
ornameof
as identifiers in C#, it's fine to break the tiny number of people who do. - "Some people are broken."
The pattern where
_
is the name of a used lambda parameter (e.g._ => _.Name
) is fairly rare, but not unheard of in C#. I think it's probably acceptable to break this kind of code, but ways of softening the blow should be seriously considered (e.g. a code fix to rename such parameters; or warning in C# 10 and only making it an error in C# 11). - "Lots of people are broken."
The pattern where
_
is the name of an unused lambda parameter (e.g._ => {}
) is very common in C# and it is completely unacceptable to break such code. The obvious solution would be to change the meaning of that code to make the_
a discard. But I think it's important to note that this additional change would be required, assuming this proposal is not meant to be a massive breaking change.
- Will this break analyzers/codefixes?
@svick looking at github we have:
While I am only proposing removing name lookup and var _ = GetResults();
is not lexically ambiguous with _ = GetResults();
there could potentially be odd errors with _
in this case. I am willing to say we keep the name lookup rules for _
if there are concerns.
Will this break analyzers/codefixes?
that is entirely dependent on the compiler implementation, but new language versions are always allowed to break analyzers.
Consider when expression-bodied-members were added. If you previously assumed (not unreasonably) in your analyzer that all methods with a body contained a block syntax you were broken since now the body could just be an arrow expression.
See: https://github.com/dotnet/csharplang/discussions/4466
Conflating the two was always going to create a lot of confusion, and the parser that get's broken/confused the most is the human parser.
I've switched my position on https://github.com/dotnet/csharplang/issues/1064 (disallowing _
as an identifier) from downvote to upvote. Sure, it would break the past years of my code in which I used to use this for lambda parameters, but ship a solution-wide light bulb fix for it and I'd want to use that light bulb fix anyway to replace my usages of _
, even if I wasn't forced to. Opting into a new major versions of C# feels like an expected time for something like this to happen.
Maybe soft-deprecating by adding a new compiler warning in C# 10 that tells you that _
as an identifier will be disallowed starting in C# 11 would make this seem less abrupt.
I agree with @HaloFour. I think the human parser should be the most important factor. While unlikely, the possibility of code like this should make us uneasy:
public class Foo
{
private int _;
// Doing something important in some other file that is affected by reading Foo.WrappedValue?
public int WrappedValue => _;
public bool IsNumber(string input)
{
return double.TryParse(input, out _); // oops!
}
}
(stolen from http://gafter.blogspot.com/2017/06/making-new-language-features-stand-out.html?showComment=1509474504510#c7458806139970524286)
I hope this proposal goes nowhere. I like the underscore usage because it makes the code less boring. As a writer, I like to use dashes, colons, semicolons, and etc to make my writing more interesting just like an _ makes the code more interesting, though it's rarely used. I want C# to be an intermediate-level language and developers having trouble with var and etc, they should look at Lua or etc.
Java, which is used by a larger (and some would argue more resistant to change) community seemingly has had little/no issue with the language deprecating and then disallowing the use of _
as an identifier or var
as a type name. They've done this in the past as well with names like assert
. Usually it doesn't matter as the contextual keyword wouldn't be expected to be used as a type name.
I've been pretty outspoken against the use of _
as a discard as well as an identifier. Since the ship has sailed on discards I think its use as an identifier should be reconsidered. It sounds good on paper to avoid reinterpreting/breaking any existing code, but now the language has this wart where developers need to remember which combination of features will cause the compiler to prefer _
as an identifier vs. where the compiler will always consider it a discard, and where _
is preferred to be an identifier the developer has to wade past the type checks on these "variables" that the developer never intended to actually use. The case of accidentally overwriting some field name might be pathological, but the mental burden on the developer will still always be there. I would've much preferred if the compiler phased out _
as an identifier over a few releases, with fixers to replace it with another identifier, and then switch it to a discard wholesale. Names are cheap. Contextual keywords that change their meaning based on nuanced use of other language features and where the contexts are very likely to collide are not.
My views are basically the same as @HaloFour 's. I think semantically contextual keywords make a lot of sense in theory but add developer complexity and overhead for little benefit.
for little benefit.
I think this is very debatable. Consider the work we're doing to support field
inside properties in C# 10. If we make these keywords and not contextual keywords, we simply break people (including ourselves). And we break them despite them having done nothing wrong. For example, it would break code that is totally normally and reasonable and not at all deviating from teh norms of the ecosystem at all. I don't like the idea that someone coudl follow every best practice we gave, and then end up breaking just for expediency on our part. In most (all?) cases, supporting semantic contextual keywords is not difficult. Indeed, it's one of the simpler things to support. You simply bind as normal and accept the prior meaning if it is valid. If it isn't, then you allow the new meaning. This means we can gently add new things to the language and not have to worry at all about breaking people.
Consider the work we're doing to support field inside properties in C# 10.
For clarification: this proposal explicitly states that cases like this should keep working as they do, contextual keywords will always exist. Just because properties use the value
contextual keyword does not mean that we should now force that to be a keyword at all times. This proposal is about ensuring that contextual keywords can always be determined based solely on lexical information as opposed to semantic information.
In most (all?) cases, supporting semantic contextual keywords is not difficult. Indeed, it's one of the simpler things to support. You simply bind as normal and accept the prior meaning if it is valid. If it isn't, then you allow the new meaning. This means we can gently add new things to the language and not have to worry at all about breaking people.
I totally agree that there is not engineering reason to change this. It just works for the compiler folks (as far as I am aware). But I think it add an unnecessary burden on programmers using the language. Things like var
and nameof
feel very unfortunate to me. Anyone following best practices in C# does not expect var
to be unavailable to them or for nameof
to have different semantics based on exoteric name lookup rules. It feels like a real "gotcha" moment where I can go on twitter and "well actually" anyone that uses code with these semantic contextual keywords and say "Oh you are actually not discarding that but assigning a value to a variable named _
".
Every other language I've encounterd (C++, Jave, Typescript, Pytho, Go) does not use name lookup rules to determine whether something is a keyword (including F# and Visual Basic) and there have been no complaints. I personally feel that all this concern over keyword breakage has no real evidence, its all theoretical. Java can just add a new keyword if they need to and no one complains.
This proposal is about ensuring that contextual keywords can always be determined based solely on lexical information as opposed to semantic information.
Right. but the problem with that is that it direclty goes against design goals we have for these features. for example, we want you to just be able to say field
. There's nothign lexical/syntactic to distinguish that this is special. It's just going to reference the auto-prop field if nothing else binds.
ut I think it add an unnecessary burden on programmers using the language
I don't really see this as a burden. For people just using the language, using var
is going to work. So what needs to be fixed? Same with nameof, etc. People using our APIs could certainly be better served here with better APIs. but that would be a roslyn concern.
and there have been no complaints.
This is not true. TAke 'go' for example. There are lots of complaints about the verbosity of the language. And part of htat verbosity arises because the language doesn't want to get into this space. So it ensures all it's constructs are extremely verbose and often unweildy, just so it doesn't have to do any semantic checks on this sort of thing. It's a tradeoff they made, but which we're quite loathe to as it really just bulks up the language.
Right. but the problem with that is that it direclty goes against design goals we have for these features. for example, we want you to just be able to say field. There's nothign lexical/syntactic to distinguish that this is special. It's just going to reference the auto-prop field if nothing else binds.
I would need to review the proposal but isn't this going to work exactly like value
? You can just say that field
is reserved now and you use @field
if you need to "escape" the fact that this is a keyword now. I think this is an important distinction to the reader. You now have to explicitly state what your intent is. You are essentially saying "a casual reading of this might lead to believe this is the field
keyword, which has specific semantics but that is not what is happening here, this is a custom instance and @field
clues you into what is happening." If we were to do it all over again would we have everything be a contextual keyword? I dunno I suppose I could see the argument, why put roadblocks in folks way. My position is that it's a weird language corner case that most C# developers are not aware of and is surprising to them when they learn about it.
If there is a design goal that can only be achieve with name lookup rules or everyone else in the LDM just disagrees and thinks that semantic contextual keywords are awesome and we wish we did them more often great! Thats not my position but I am willing to be convinced.
I would need to review the proposal but isn't this going to work exactly like value?
No. 'value' always binds to the property parameter prior to anything else in a higher scope. field
will not (As that would break existing, perfectly fine code).
You can just say that field is reserved now and you use @field if you need to "escape" the fact that this is a keyword now
That would break lots of code taht is totally fine today and which wasn't doing anything strange or inappropriate. I do not see how customers are helped by just changing the meaning of their code on them.
You are essentially saying "a casual reading of this
We are not, and should be beholder to 'a casual reading of this'.
If you see this:
local = 0;
What does a casual reading tell you? Almost nothing. This could be a local, or a field, or a property, or a parameter. it could be assigned. it could be assigned by-ref. it could have conversions. it could throw. etc. etc. etc.
And that's just assignemnt. Once you get the .
operator, all bets are 100% off :)
My position is that it's a weird language corner case that most C# developers are not aware of and is surprising to them when they learn about it.
Weird corner cases are always like that. But we have tons of those everywhere. The question is: is getting rid of weird corners better or worse than breaking code? The position we've landed on generally comes down to:
- is the code that is breaking reasonable? or is it unreasonable?
- is it widespread, or likely not used at all?
If it's unreasonable (which often comes down to debate) we are more likely to take the stance: trying to prop up this code is not worth it, so we would prefer to change it and accept that pathological cases break.
Similarly, if something is widespread, then we've already opened the barn door. People clearly are using the language in this fashion in a significant fashion, and I think we have to accept that.
Where we have room to play around with is when you get into teh 'unreasonable, and not used (or very very rarely used)' territory. This is like someone coming along now and saying: yeah, i'm going to name my type var
even though .net naming conventions (both formal and informal) from day 1 have been that types are PascalCased. This is both unreasonable IMO for somoene to do this, and likely extraordinarily niche. (Indeed, my expectation is that this only exists in projects that seek to subvert the language/compiler, in which case i don't think of that as a reasonable thing to cater to).
--
So, in the case of some keywords (var
, record
, etc.) i'm actually ok with us taking over and saying: yeah, at this point this is ours. Reasonable codebases won't have any pain at all moving to this.
However, for some keywords, i'm not ok with us doing this. If the pattern is either reasonable, or widespread, we need to accept that and not harm users when we have a perfectly suitable way to both introduce the feature and keep things working just fine.
So, in the case of some keywords (var, record, etc.) i'm actually ok with us taking over and saying: yeah, at thsi point this is ours. Reasonable codebases won't have any pain at all moving to this.
However, for some keywords, i'm not ok with us doing this. If the pattern is eitehr reasonable, or widespread, we need to accept that and not harm users when we have a perfectly suitable way to both introduce the feature and keep things working just fine.
I think this is a totally reasonable stance to take. var
feel pretty uncontroversial (to me) but other keywords feel much further along in the spectrum of causing unreasonable harm. If the LDM says "var
should just be a keyword but these others I think should stay as they are" I would be totally fine with that. I Just want us to take the time to re-evaluate this and make sure we still feel the same way.
In the past there were more situations where a newer version of C# could be "pushed" on you. Today it's an explicit decision to update your SDK version to get an updated version of C#. Major SDK versions also have major breaking changes (api names changes etc.) to the point that developers expect some friction. I think its not unreasonable to have folks change field
to @field
in these upgrade situations but I will admit I am taking a stance that is way over to one side on how ok I am with breaks. Others do not need to join me over here.
If we scope this to the language reserves the space of lowercase ascii identifiers for **type** contexts
, then i'm totally ok with that :)
That would address, var, unmanaged, notnull, dynamic
and possibly some others that i'm not remembering.
I like the scoping here but I would also like to consider the case of discards. As the feature is written today it's hard to use discards broadly in a method and instead is most useful in a limited set of circumstances. In too many cases it subtly turns into an identifier, not a discard, and suddenly that invalidates other uses within the method body and suddenly you have to drop back to ignored names.
I'd def like to break out an issue on discards. I;m curious about hte cases that are hard here and where it's difficult to mesh the idea of:
- use existing semantics if the code is legal
- reinterpret as discard if not
I think discards are also a space we could potentially experiment with a .net upgrade
style approach where we unilaterally reinterpretted this stuff, but had tools fix the issue if you used these as non-discards in your project.
I;m curious about hte cases that are hard here
Converting between lambdas and local functions. Parameters in lambdas can be discards but not in local functions. That means when swapping between the two it introduces unnecessary friction because you have to rationalize discard behavior. It's no longer what essentially amounts to a syntax transform.
Whether a _
is a discard or identifier in a lambda comes down to the count of parameters that you have. A single parameter means it's an identifier but multiple mean it's a discard.
// _ is an identifier
Action<int> action = (_) => {
_ = ""; // Error cause _ is an int identifier
};
// _ is a discard
Action<int, int> action2 = (_, _) => {
_ = ""; // Okay cause this is a discard
};
This is generally frustrating to have to remember but really gets frustrating when you consider it in the context of refactoring or code changes. Consider that lambda parameters are often listed as discards because they're a callback value that you may not need. Circumstances change and it's rational to begin using a parameter which begins by assigning it a name. If assigning that parameter a name though means there is only one _
remaining then it becomes an identifier and suddenly all the other _
inside the method body are now interpretted as identifiers which can cause compilation errors.
string token = ...;
Action<string, string> = (_, value) => {
// Error: This worked before I changed the second parameter to have a name
if (int.TryParse(token, out _) {
...
}
This though means there is a huge incentive to prefer out var _
over out _
even when _
currently points to a discard. The out var _
form is one of the few places where _
unambiguously refers to a discard. Yet even though _
is more succinct developers should consider always using the out var _
form, even though it's longer and doesn't actually declare a variable, because it's more future proof to cases where _
gets bound as a discard.
These together all make it frustrating to use discards. It's too easy to get trapped in a case where _
suddenly binds to an identifier and that will invalidate many other cases in the method where you depended on having discards available and there is little recourse for the developer when that happens.
Whether a _ is a discard or identifier in a lambda comes down to the count of parameters that you have. A single parameter means it's an identifier but multiple mean it's a discard.
Could we change that and instead make it so that if it's an error with the prior semantics, then it can now be reinterpretted as a discard?
These together all make it frustrating to use discards. It's too easy to get trapped in a case where _ suddenly binds to an identifier and that will invalidate many other cases in the method where you depended on having discards available and there is little recourse for the developer when that happens.
I have a supposition we can fix that, without having to go whole-hog into: all _
are always discards.
The open question for me is if there are cases where code would be legal under either interpretation (identifier or discard), and you want the latter, and interpretting as the former would lead to undesirable behavior. If that exists, then this approach would likely not be viable. However, my hunch is that this would allow for:
- existing code to continue to compile with its existing meaning.
- Code that is currently in error will now compile, with a meaning that is sensible.
- Code that could potentially have both meanings (and this will retain the 'identifier' interpretation) will behave in a desirable way.
@CyrusNajmabadi
Could we change that and instead make it so that if it's an error with the prior semantics, then it can now be reinterpretted as a discard?
I like where this is going but it sounds like there could be a lot of potentially tricky edge cases, especially if the code seems to intentionally mix discards and _
as an identifier:
if (int.TryParse(s, out _)) {
// ...
}
// later ...
var bar = foo.Select(_ => _.Bar)
Could we change that and instead make it so that if it's an error with the prior semantics, then it can now be reinterpretted as a discard?
Can't do that because _
is a legal identifier. As @HaloFour pointed out it's just fine to use it via _.ToString()
, etc ... You can't even take shortcuts like saying "okay, if _ is only used for assignment or out
then make it a discard" because assignments to a _
can have side effects (implicit conversion tricks).
This is the core problem we're facing. The decisions of C# 1.0 are essentially limiting our ability to make _
a friction free feature. Unless we take some sort of conditional break here then we're essentially stuck with those decisions.
Can't do that because _ is a legal identifier.
I'm not taking about the cases where is has legal, error free, semantics.
I'm talking about the cases where it has illegal semantics. For example, where it would cause a scope collision.
This code would be illegal today, and so we can come up with rules to make it legal by saying: ah, ready these all as discards now.
I'm unsure what you're asking for at this point.
In cases where the code compiles today using current rules, preserve the meaning of that code.
In cases where the code does not compile (for example, because of scope collision), allow reinterpretation as a discard.
This is effectively similar to how other semantic identifiers work. If 'nameof' binds, then use that, otherwise it is the semantic keyword.
Except instead of asking if it binds, ask if there is a scoping collision or not found, it things like that. In that case, treat as discard.
That just sounds incredibly dangerous to me. Did you mean to use it as identifier and messed up, or did you actually want to discard? The intent of _ is to make the programmer's intent clear, but this would do the opposite.
Like I said, I want to do the mental exercise here.
With the other semantic identifiers the above holds (with the same arguments), but it really didn't turn it to be an issue.
My supposition is that it will be very obvious very quickly.
Binding isn't enough though because it doesn't fix any of the problems I outlined. Today function parameters and single parameter lambdas are always bound as identifiers. Hence we can't take the approach we take with other contextual keywords (it's what we do today)
If we want to take the approach of "discard if it doesn't impact behavior" then that means we have to effectively implement two different binding passes. Because in order to determine if it's legal as discard, by that I mean doesn't change the side effects of the program, then you have to do semantic analysis. Have to understand for example if the current approach has silent implicit conversions. Consider the following as an example:
M(x, _ => { _ = x; });
In order to understand if it is legal to treat _
as a discard or must be preserved as an identifier you must go through a full binding pass. It's completely possible that it binds to the following for which treating _
as a discard in the future would be a breaking change.
void M(Action<string, dynamic> a);
I don't think this is worth doing a double binding pass on methods which is why I'm pushing for other approaches.