Improve type vs variable detection
The current parser cannot fully disambiguate some expressions that may both be a type and an expression:
-
In
sizeof(A[10]),Acan be a type and the expression evaluates to the size of an array of 10A, orAis a global constant array and the expression is the size of an element of this array. -
when parsing statements, an imperfect heuristic is used to distinguish declarations from expressions:
a * b;could be a multiplication whose result is not used or a definition ofbas a pointer toa. Case could be used to disambiguate, but I would prefer not to have to rely on case alone.
There is a simple solution to this problem: we could perform 2 full passes on every source file:
- the first pass would serve only to detect type names: it would parse only
typedefinitions for type names and would use an accelerated tokenizer to parse other statements and bracketed contents. This should be very fast. - the second pass would perform the current parsing, but identifiers would no longer be ambiguous, type names would be known and this would simplify the parsing as well as the analysis phases.
Doing 2 full passes on a file does not solve the issues, because the Analyser is only called after all parsing is done. In the early days, I tried using many mini-parsers, that would stop on reaching something unknown. Then other parsers would continue etc. This became so complex that i abandoned this idea.
For sizeof, we could specify that A[10] just means 10xA. Also we could forbid the use of [] is sizeof altogether, since it's a bit weird anyways. You either want the size of a type or the size of a variable's type. Sizeof is the only weird expr, in that you can have a Type or Variable as argument.
I really, really want to avoid doing 2 passes at the parser level..
Doing 2 full passes on a file does not solve the issues, because the Analyser is only called after all parsing is done. In the early days, I tried using many mini-parsers, that would stop on reaching something unknown. Then other parsers would continue etc. This became so complex that i abandoned this idea.
The initial pass would need to be performed on all source files before the current compilation process can start as a second phase. By then all type names will be known so the parser can use a much simpler approach to determine if an identifier or a member expression refers to a type or not.
For sizeof, we could specify that A[10] just means 10xA. Also we could forbid the use of [] is sizeof altogether, since it's a bit weird anyways. You either want the size of a type or the size of a variable's type. Sizeof is the only weird expr, in that you can have a Type or Variable as argument.
The current implementation assumes that sizeof(A[10]) is the size of an array of 10 objects of type A. A is only ambiguous if there is a global variable by the same name. No need to further restrict it. C allows sizeof expr without parentheses as a unary operator, we could support that easily.
I really, really want to avoid doing 2 passes at the parser level.
Given the speed of the lexer, this would be hardly noticeable. Note that knowing the type names would help remove many calls to Tokenizer.lookahead.
For ambiguous cases there are basically two ways out of it:
- Infinite lookahead and disallow
a * b;. This is fine because in practice the lookahead needed is very short. You can either reuse the tokens or retokenize. This is used by D - Make the lexer disambiguate names by requiring distinct naming for types. Downside is the need to redefine C names that are not conforming, upside is that tooling and syntax highlighting is trivial. This is what I used in C3 based on the C2 naming rules.
For the latter sizeof(Foo[3]) would always know that this is a type declaration and sizeof(A[3]) and sizeof(foo[3]) would all be indexing into a constant / variable.
during the parsing: sizeof(Foo[3]) and sizeof(A[3]) are equivalent in terms of tokens. So the parser has no way of knowing if it's a type/constant. So how could it differentiate?
You're interested in (1)?
Start parsing it as an expression or declaration, then switch to the other when it fails. For the ambiguous case of A[2], it will stay in the first attempted parsing form. It is then semantically disambiguated:
- Lookup the innermost value (in this case
A) - If this is a type then treat it as a type (rewrite if necessary)
- If this is a variable or constant then treat it as an expression (rewrite if necessary)
Deferred parsing is also possible and I tried it but didn't like it much. It was used in Jiyu successfully though.
During parsing, no lookup can be done yet, since the file containing a symbol's definition might not even be opened yet. In some case 'Foo*' we see that Foo must be a Type, but Foo[3] is already ambiguous. I think sizeof is the only nasty expression that has this 'feature'.
Did you change the way you parse in C3? So that during parsing a symbol is already known (like C)?
I know it's not known, but you can disambiguate it during the semantic step. I already actually push some errors from parsing to semantic step in order to create better errors. It's not a big deal to do so.
C3 is LL(1). It can be that because _*[A-Z][A-Z_0-9]*[a-z][A-Za-z0-9_]* is always a user defined type, and anything not matching that is either a constant (_*[A-Z][A-Z_0-9]*) or an identifier (_*[a-z][A-Za-z0-9_]*). Builtin types (int, double, float etc) are the exception. This is built into the lexer.
So the lexer would identify Foo as a type token, A as a constant and foo as a regular identifier. Strictly speaking, the difference between a constant and regular identifier is not necessary.
@lerno you can do this because types in C3 must have an uppercase initial and at least one lower case letter, whereas constants must be all capital letters?
Delayed parsing is an interesting option, but the current parsing must be able to continue and scanning for the ) skipping balanced pairs of delimiters can allow that.
A more radical solution is an initial pass on all source files to collect all type name definitions and skip the code efficiently. This would remove the need for syntactic disambiguation using the character case.
Doing a radical rewrite of the parsing system just to fix sizeof seems a bit overkill. We already postpone the type/variable check to the semantic phase. Otherwise we could specify that the sizeof argument can only be a Type or a variable. Doing 2 complete passes will cause a lot of slowdown
I think multiple passes is overkill as well. The need for sizeof(a[0]) is mostly due to the inability to otherwise get the size of an element. This is especially important in C with macros.
Incidentally, in C3 I made sizeof(...) (or rather $sizeof(...), but it's the same thing) only take expressions, and the size of a type is retrieved using Foo[123].sizeof.
Similarly, one might consider sizeof(...) for types only and exprsizeof(...) for expressions.
But again, speculatively parsing sizeof(...) as containing a type, then turning it into an expression if it fails seems like the easiest route following C. The exprsizeof solution is the overall easiest but introduces a new keyword and a diff compared to C.
In my latest patch, the only remaining ambiguity is sizeof(Foo[20]) where Foo is assumed to be a type but could in fact be a global variable defined as an array. This is unlikely to ever be a problem, and if the users really want the size of an element of Foo they can write sizeof(*Foo). We currently do not support sizeof expr without parentheses, but parsing that would be quite easy and sizeof Foo[20] would be unambiguous too. Adding this syntax is advisable for the sake of C compatibility. Does C3 support it? Or does $sizeof require parentheses?
I started some discussions on Github, not sure if you get a ping for that (so abusing this thread ;) )
$sizeof has parenthesis in C3, because it's a compile time function, there is also $alignof $offsetof etc
This was fixed for most cases by #357