cppfront
cppfront copied to clipboard
[BUG] the grammar comments are not always correct
Out of curiosity I have an implemented an alternative parser for cppfront / cpp2, which uses a PEG grammar as input for a parser generator. During that experiment, I noticed that the grammar rules embedded as //G
comments are not always correct. I will list errors that I noticed below.
One preliminary note: The cppfront compiler has a rather relaxed concept of keywords. In most cases it will accept a keyword were an identifier is expected, for example it will happily compile if: () -> void = { }
. I don't think that is a good idea, my grammar explicitly distinguishes between keywords and identifiers. (Modulo the few context specific soft-keywords like in
/out
etc.). For some grammar rules that requires changes were the parser previously worked by accident (i.e, by not recognizing a certain keyword).
a) id_expression
//G id-expression
//G unqualified-id
//G qualified-id
//G
here the order is wrong, it should be
//G id-expression
//G qualified-id
//G unqualified-id
//G
b) primary_expression
//G primary-expression:
//G literal
//G ( expression-list )
//G id-expression
//G unnamed-declaration
//G inspect-expression
//G
this does not correspond to the source code order. Furthermore, the expression-list is optional. And if we distinguish keywords from literals we potentially need some extra rules to handle keywords that are currently silently eaten as identifier. I would suggest
//G primary-expression:
//G inspect-expression
//G id-expression
//G literal
//G '(' expression-list? ')'
//G unnamed-declaration
//G 'nullptr'
//G 'true'
//G 'false'
//G 'typeid' '(' expression ')'
//G 'new' < id-expression > '(' expression-list? ')'
c) nested-name-specifier
//G nested-name-specifier:
//G ::
//G unqualified-id ::
this has to support nested scopes. I would suggest
//G nested-name-specifier:
//G :: (unqualified-id ::)*
//G (unqualified-id ::)+
d) template-argument
//G template-argument:
//G expression
//G id-expression
There should be a comment here that we disable '<'/'>'/'<<'/'>>' in the expressions until a new parentheses is opened. In fact that causes some of the expression rules to be cloned until we reach the level below these operators. (In my implementation these are the rules with suffix _no_cmp).
e) id-expression from fundamental types
We want to accept builtin types like int
as type ids. Currently this works by accident because the parser does not even recognize these as keywords. When enforcing that keywords are not identifiers we need rules for these, too. I have added a fundamental-type
alternative at the end of id-expression, and have defines that as follows:
fundamental-type
'void'
fundamental-type-modifier_list? 'char'
'char8_t'
'char16_t'
'char32_t'
'wchar_t'
fundamental-type-modifier-list? 'int'
'bool'
'float'
'double'
'long' 'double'
fundamental-type-modifier-list
fundamental-type-modifier-list
fundamental-type-modifier+
fundamental-type-modifier
'unsigned'
'signed'
'long'
'short'
Excellent experiment. It brings even more insights.
We want to accept builtin types like int as type ids. Currently this works by accident because the parser does not even recognize these as keywords.
Isn't it desirable to leave the builtin types outside of the reserved words of the grammar? They are identifiers after all, just ones that the user didn't define somewhere else.
Officially they are keywords, we should not treat them as regular identifiers. Otherwise you can define a function called "int". And they behave a bit strange anyway, for example "unsigned long int" and "long unsigned int" are both valid (and in fact identical). I think that has to be done explicitly, we cannot pretend that these behave like regular identifiers.
for example "unsigned long int" and "long unsigned int" are both valid (and in fact identical)
Ok yeah this sold it to me. Even if only one of those two combinations were allowed, unsigned long UserType
is senseless.
Unless one were to remove all those keywords anyway and offer only one-word builtins, uint8_t
, int32_t
, etc.
Giving a function the same name as a type is equally undesirable for builtin or user-defined types. It's a problem of name collision / shadowing, which in my mind is a whole different category from if: () -> void = { }
.
Even if only one of those two combinations were allowed,
unsigned long UserType
is senseless.Unless one were to remove all those keywords anyway and offer only one-word builtins,
uint8_t
,int32_t
, etc.
I would vote for that. The same rule for all types. No special cases.
So either unsigned long int
is a type and therefore unsigned long UserType
is too.
Or we have, like you said, just uint32_t
and no unsigned long int
.
Or we treat unsigned
and long
as some form of concepts or template parameter with which you can constrain or widen the type. For an example I can currently just think of Strings, which often need to be constrained, like having a size from 10 to 12, or having not white-spaces, or being trimmed, or allowing just letters from A-F. Which then could be molded into in more strict type, e.g. using hex_string = no_whitespace trimmed size<4> allowed_chars<HEX> string;
(Just a spontaneous thought.)
I would vote for that. The same rule for all types. No special cases. So either
unsigned long int
is a type and thereforeunsigned long UserType
is too. Or we have, like you said, justuint32_t
and nounsigned long int
.
But that breaks compatibility with existing C++ code. Note that, e.g., long
and long long
are two distinct types, with distinct overloads, even if both map to 64bit integers. Silly, I know, but if we only support int64_t
we cannot seamlessly interact with existing code bases.
It wouldn't be hard to convince me that leaving this craziness behind is a worthwhile goal, even without the issue of multiword types we were discussing. I've been bitten by char
being signed or not depending on the platform.
Can still provide them for compatibility without making them the path of least resistance. E.g. by offering compat::unsigned_long_long
etc. that the transpiler would special-case to produce the equivalent C++ type. This would increase the complexity of one part of the transpiler, with the hope of simplifying another part (the grammar). But I don't know if the tradeoff is worthwhile. It does seem to me to be simpler for the user.
In what context is a multi-identifier type a problem? I think they might be OK in declarations, thanks to the :
giving a heads up to the parser. Perhaps those other contexts could require :
, when a single identifier is otherwise expected, to specify a multi-identifier type.
From a parsing perspective it is not a problem, it is just that the rules for fundamental types are a bit funny. For example signed short int
, short int
, short signed int
, and short
are all the same type. Thus the parser has to recognize that this is a fundamental type and has to normalize it into a certain form. You cannot pretend that a fundamental type is just like a regular user defined type.
Thus the parser has to recognize that this is a fundamental type and has to normalize it into a certain form.
Normalizing this is not a problem for cppfront
since Herb wants the Cpp1 source to resemble what the user wrote for Cpp2.
I would vote for that. The same rule for all types. No special cases. So either
unsigned long int
is a type and thereforeunsigned long UserType
is too. Or we have, like you said, justuint32_t
and nounsigned long int
.But that breaks compatibility with existing C++ code. Note that, e.g.,
long
andlong long
are two distinct types, with distinct overloads, even if both map to 64bit integers. Silly, I know, but if we only supportint64_t
we cannot seamlessly interact with existing code bases.
But this could be a nice feature for the only allowed cpp2 code. Without the flag, backwards compatibility is not broken. With the flag, welcome to the real modern C++!
Normalizing this is not a problem for
cppfront
since Herb wants the Cpp1 source to resemble what the user wrote for Cpp2.
but I think in the long run that is not sufficient. cppfront
is currently more like a preprocessor, it takes one syntax and transforms it into another. That has the advantage that the system does not have to understand the semantic of the program, but it has the disadvantage that the compiler cannot do error checking and it also cannot do type resolution or name lookup. Therefore I have started implementing full semantic analysis in my little experiment, e.g., the compiler infers all data types and checks if the program is valid. And for that you need proper handling of fundamental types, simple text replacement is not good enough.
I understand that. And you mentioned that parsing a multi-identifier type is not a problem. So I take it that this was a digression from the main issue.
This was all about one specific point of the many @neumannt discovered with his reimplementation of the parser: "We want to accept builtin types like int as type ids. Currently this works by accident because the parser does not even recognize these as keywords." I suggested that what can be considered an accident of the current implementation (builtin types aren't reserved keywords), feels to me like should be a feature.
Thanks! I've adopted some of the changes, particularly production ordering.
I haven't changed the grammar to disallow relational operators outside (
/)
inside <
/>
because I view it as a semantic rule. I'm not convinced this needs to be in the grammar, even though I've implemented it in the parser as a parameter (IMO that's left-shifting a sema rule so we catch it early at parse time, not making it an actual parsing/grammar rule). Feel free to push back if you think I'm taking too many conceptual liberties here though.
Re fundamental types, I don't intend to support multi-token names like unsigned long int
and such. IMO uint64
for the win... for now it's std::uint64_t
but on my to-do list is to have nicer aliases for these.
Again, thanks!
Multi-token fundamental types are now supported: 966856f99ec859861dbee4383d20a3b9d14019cb.
Also I just checked in support for the C-style multi-token fundamental types... things like
unsigned long long
are now merged into a single Cpp2 token that contains internal whitespace, which I think is an elegant solution for those and that didn't disturb the rest of the code. With those fundamental types now all supported, and the multi-*
pointers supported, I think we now can utter all the C/C++ types for full compatibility... please open an issue if I missed any. -- https://github.com/hsutter/cppfront/pull/106#issuecomment-1367672928
@JohelEGP Thanks for mentioning that on this thread too.
I wasn't going to support the multi-token names, but once I thought of a simple solution, I did support them because it's important to avoid any compatibility friction with today's syntax where it doesn't compromise Cpp2's design (and this doesn't, we don't have to encourage using them in Cpp2 even though they happen to work).
Speaking of which, if we don't encourage those, what should we encourage? I wrote above...
IMO
uint64
for the win... for now it'sstd::uint64_t
but on my to-do list is to have nicer aliases for these.
... and I just pushed that support in commit 4d9d52d7b4c58c8800ef474d208637b6f2e585de. Quoting the commit:
Add support for fixed-width integer type aliases (i32, u64, etc.)
Note these are not reserved words, because that would conflict with existing code. Unqualified
i32
meanscpp2::i32
, but it's fine a program to have its owni32
identifier and use it via namespace-qualification disambiguation.The naming pattern is:
u
ori
(unsigned or signed)8
,16
,32
, or64
(bit width... I'm open to adding more as they can be portably supported)- (optional)
_fast
or_small
(following<cstdint>
's_fast
and_least
)
On second thought, the important thing is to support a way of uttering the types that are spelled multi-word in C/C++ today, but I can do that just by providing type aliases. Then Cpp2 code visibly uses normal single-token names throughout, while still having full interop compatibility with all the non-fixed-width C/C++ types.
So I've tweaked the approach: I'm leaving in the code that handles the multi-word names in case a programmer does try to write them (which many will because of familiarity), but instead of merging them and making them work as I did last week, I'll reject them with a hopefully-nice diagnostic directing the programmer to use the nicer names instead (and giving extra discouragement for the really pernicious ones like signed char
).
Checked in yesterday in b972313d09224f2747f1f4c03c53167b03ce4e40, improved diagnostic today with 98ae2b9d1734e7e5bf667ca10962d95abf7e5fb8. Here's an example diagnostic:
test.cpp2(7,8): error: 'signed char' - did you mean 'i8' (usually best) or '__schar'?
test.cpp2(7,8): error: 'signed char' is an old-style C/C++ multi-word keyword type
- most such types should be used only for interoperability with older code
- using those when you need them is fine, but name them with these short names instead:
short, ushort, int, uint, long, ulong, longlong, ulonglong, longdouble, __schar, __uchar
- see also cpp2util.h > "Convenience names for integer types"
I decided to make the names for explicitly-signed/unsigned char
extra ugly, because a character is not an integer (even though C conflates them). In Cpp2 I'm aiming toward making "character" separate from "numeric", including eventually allowing mathematical operations only on numeric types, and character operations only on character types. It's just a category error to do math with letters (insert joke here about how the Romans tried).