cppfront [BUG] the grammar comments are not always correct

Out of curiosity I have an implemented an alternative parser for cppfront / cpp2, which uses a PEG grammar as input for a parser generator. During that experiment, I noticed that the grammar rules embedded as //G comments are not always correct. I will list errors that I noticed below.

One preliminary note: The cppfront compiler has a rather relaxed concept of keywords. In most cases it will accept a keyword were an identifier is expected, for example it will happily compile if: () -> void = { }. I don't think that is a good idea, my grammar explicitly distinguishes between keywords and identifiers. (Modulo the few context specific soft-keywords like in/out etc.). For some grammar rules that requires changes were the parser previously worked by accident (i.e, by not recognizing a certain keyword).

a) id_expression

    //G id-expression
    //G     unqualified-id
    //G     qualified-id
    //G

here the order is wrong, it should be

    //G id-expression
    //G     qualified-id
    //G     unqualified-id    
    //G

b) primary_expression

    //G primary-expression:
    //G     literal
    //G     ( expression-list )
    //G     id-expression
    //G     unnamed-declaration
    //G     inspect-expression
    //G

this does not correspond to the source code order. Furthermore, the expression-list is optional. And if we distinguish keywords from literals we potentially need some extra rules to handle keywords that are currently silently eaten as identifier. I would suggest

    //G primary-expression:
    //G     inspect-expression
    //G     id-expression
    //G     literal
    //G     '(' expression-list? ')'
    //G     unnamed-declaration
    //G     'nullptr'
    //G     'true'
    //G     'false'
    //G     'typeid' '(' expression ')'
    //G     'new' < id-expression > '(' expression-list? ')'

c) nested-name-specifier

    //G nested-name-specifier:
    //G     ::
    //G     unqualified-id ::

this has to support nested scopes. I would suggest

    //G nested-name-specifier:
    //G     :: (unqualified-id ::)*
    //G     (unqualified-id ::)+

d) template-argument

    //G template-argument:
    //G     expression
    //G     id-expression

There should be a comment here that we disable '<'/'>'/'<<'/'>>' in the expressions until a new parentheses is opened. In fact that causes some of the expression rules to be cloned until we reach the level below these operators. (In my implementation these are the rules with suffix _no_cmp).

e) id-expression from fundamental types

We want to accept builtin types like int as type ids. Currently this works by accident because the parser does not even recognize these as keywords. When enforcing that keywords are not identifiers we need rules for these, too. I have added a fundamental-type alternative at the end of id-expression, and have defines that as follows:

fundamental-type
  'void'
  fundamental-type-modifier_list? 'char'
  'char8_t'
  'char16_t'
  'char32_t'
  'wchar_t'
  fundamental-type-modifier-list? 'int'
  'bool'
  'float'
  'double'
  'long' 'double'
  fundamental-type-modifier-list

fundamental-type-modifier-list
  fundamental-type-modifier+

fundamental-type-modifier
  'unsigned'
  'signed'
  'long'
  'short'

Oct 13 '22 14:10 neumannt

Excellent experiment. It brings even more insights.

Oct 13 '22 15:10 filipsajdak

We want to accept builtin types like int as type ids. Currently this works by accident because the parser does not even recognize these as keywords.

Isn't it desirable to leave the builtin types outside of the reserved words of the grammar? They are identifiers after all, just ones that the user didn't define somewhere else.

Oct 13 '22 19:10 jcanizales

Officially they are keywords, we should not treat them as regular identifiers. Otherwise you can define a function called "int". And they behave a bit strange anyway, for example "unsigned long int" and "long unsigned int" are both valid (and in fact identical). I think that has to be done explicitly, we cannot pretend that these behave like regular identifiers.

Oct 13 '22 19:10 neumannt

for example "unsigned long int" and "long unsigned int" are both valid (and in fact identical)

Ok yeah this sold it to me. Even if only one of those two combinations were allowed, unsigned long UserType is senseless.

Unless one were to remove all those keywords anyway and offer only one-word builtins, uint8_t, int32_t, etc.

Giving a function the same name as a type is equally undesirable for builtin or user-defined types. It's a problem of name collision / shadowing, which in my mind is a whole different category from if: () -> void = { }.

Oct 13 '22 20:10 jcanizales

Even if only one of those two combinations were allowed, unsigned long UserType is senseless.

Unless one were to remove all those keywords anyway and offer only one-word builtins, uint8_t, int32_t, etc.

I would vote for that. The same rule for all types. No special cases. So either unsigned long int is a type and therefore unsigned long UserType is too. Or we have, like you said, just uint32_t and no unsigned long int.

Or we treat unsigned and long as some form of concepts or template parameter with which you can constrain or widen the type. For an example I can currently just think of Strings, which often need to be constrained, like having a size from 10 to 12, or having not white-spaces, or being trimmed, or allowing just letters from A-F. Which then could be molded into in more strict type, e.g. using hex_string = no_whitespace trimmed size<4> allowed_chars<HEX> string; (Just a spontaneous thought.)

Oct 14 '22 06:10 Urfoex

I would vote for that. The same rule for all types. No special cases. So either unsigned long int is a type and therefore unsigned long UserType is too. Or we have, like you said, just uint32_t and no unsigned long int.

But that breaks compatibility with existing C++ code. Note that, e.g., long and long long are two distinct types, with distinct overloads, even if both map to 64bit integers. Silly, I know, but if we only support int64_t we cannot seamlessly interact with existing code bases.

Oct 14 '22 06:10 neumannt

It wouldn't be hard to convince me that leaving this craziness behind is a worthwhile goal, even without the issue of multiword types we were discussing. I've been bitten by char being signed or not depending on the platform.

Can still provide them for compatibility without making them the path of least resistance. E.g. by offering compat::unsigned_long_long etc. that the transpiler would special-case to produce the equivalent C++ type. This would increase the complexity of one part of the transpiler, with the hope of simplifying another part (the grammar). But I don't know if the tradeoff is worthwhile. It does seem to me to be simpler for the user.

Oct 14 '22 07:10 jcanizales

In what context is a multi-identifier type a problem? I think they might be OK in declarations, thanks to the : giving a heads up to the parser. Perhaps those other contexts could require :, when a single identifier is otherwise expected, to specify a multi-identifier type.

Oct 14 '22 16:10 JohelEGP

From a parsing perspective it is not a problem, it is just that the rules for fundamental types are a bit funny. For example signed short int, short int, short signed int, and short are all the same type. Thus the parser has to recognize that this is a fundamental type and has to normalize it into a certain form. You cannot pretend that a fundamental type is just like a regular user defined type.

Oct 14 '22 16:10 neumannt

Thus the parser has to recognize that this is a fundamental type and has to normalize it into a certain form.

Normalizing this is not a problem for cppfront since Herb wants the Cpp1 source to resemble what the user wrote for Cpp2.

Oct 14 '22 16:10 JohelEGP

I would vote for that. The same rule for all types. No special cases. So either unsigned long int is a type and therefore unsigned long UserType is too. Or we have, like you said, just uint32_t and no unsigned long int.

But that breaks compatibility with existing C++ code. Note that, e.g., long and long long are two distinct types, with distinct overloads, even if both map to 64bit integers. Silly, I know, but if we only support int64_t we cannot seamlessly interact with existing code bases.

But this could be a nice feature for the only allowed cpp2 code. Without the flag, backwards compatibility is not broken. With the flag, welcome to the real modern C++!

Oct 14 '22 16:10 TheRustifyer

Normalizing this is not a problem for cppfront since Herb wants the Cpp1 source to resemble what the user wrote for Cpp2.

but I think in the long run that is not sufficient. cppfront is currently more like a preprocessor, it takes one syntax and transforms it into another. That has the advantage that the system does not have to understand the semantic of the program, but it has the disadvantage that the compiler cannot do error checking and it also cannot do type resolution or name lookup. Therefore I have started implementing full semantic analysis in my little experiment, e.g., the compiler infers all data types and checks if the program is valid. And for that you need proper handling of fundamental types, simple text replacement is not good enough.

Oct 14 '22 16:10 neumannt

I understand that. And you mentioned that parsing a multi-identifier type is not a problem. So I take it that this was a digression from the main issue.

Oct 14 '22 16:10 JohelEGP

This was all about one specific point of the many @neumannt discovered with his reimplementation of the parser: "We want to accept builtin types like int as type ids. Currently this works by accident because the parser does not even recognize these as keywords." I suggested that what can be considered an accident of the current implementation (builtin types aren't reserved keywords), feels to me like should be a feature.

Oct 14 '22 17:10 jcanizales

Thanks! I've adopted some of the changes, particularly production ordering.

I haven't changed the grammar to disallow relational operators outside (/) inside </> because I view it as a semantic rule. I'm not convinced this needs to be in the grammar, even though I've implemented it in the parser as a parameter (IMO that's left-shifting a sema rule so we catch it early at parse time, not making it an actual parsing/grammar rule). Feel free to push back if you think I'm taking too many conceptual liberties here though.

Re fundamental types, I don't intend to support multi-token names like unsigned long int and such. IMO uint64 for the win... for now it's std::uint64_t but on my to-do list is to have nicer aliases for these.

Again, thanks!

Dec 18 '22 02:12 hsutter

Multi-token fundamental types are now supported: 966856f99ec859861dbee4383d20a3b9d14019cb.

Also I just checked in support for the C-style multi-token fundamental types... things like unsigned long long are now merged into a single Cpp2 token that contains internal whitespace, which I think is an elegant solution for those and that didn't disturb the rest of the code. With those fundamental types now all supported, and the multi-* pointers supported, I think we now can utter all the C/C++ types for full compatibility... please open an issue if I missed any. -- https://github.com/hsutter/cppfront/pull/106#issuecomment-1367672928

Dec 30 '22 02:12 JohelEGP

@JohelEGP Thanks for mentioning that on this thread too.

I wasn't going to support the multi-token names, but once I thought of a simple solution, I did support them because it's important to avoid any compatibility friction with today's syntax where it doesn't compromise Cpp2's design (and this doesn't, we don't have to encourage using them in Cpp2 even though they happen to work).

Speaking of which, if we don't encourage those, what should we encourage? I wrote above...

IMO uint64 for the win... for now it's std::uint64_t but on my to-do list is to have nicer aliases for these.

... and I just pushed that support in commit 4d9d52d7b4c58c8800ef474d208637b6f2e585de. Quoting the commit:

Add support for fixed-width integer type aliases (i32, u64, etc.)

Note these are not reserved words, because that would conflict with existing code. Unqualified i32 means cpp2::i32, but it's fine a program to have its own i32 identifier and use it via namespace-qualification disambiguation.

The naming pattern is:

u or i (unsigned or signed)

8, 16, 32, or 64 (bit width... I'm open to adding more as they can be portably supported)

(optional) _fast or _small (following <cstdint>'s _fast and _least)

Dec 30 '22 19:12 hsutter

On second thought, the important thing is to support a way of uttering the types that are spelled multi-word in C/C++ today, but I can do that just by providing type aliases. Then Cpp2 code visibly uses normal single-token names throughout, while still having full interop compatibility with all the non-fixed-width C/C++ types.

So I've tweaked the approach: I'm leaving in the code that handles the multi-word names in case a programmer does try to write them (which many will because of familiarity), but instead of merging them and making them work as I did last week, I'll reject them with a hopefully-nice diagnostic directing the programmer to use the nicer names instead (and giving extra discouragement for the really pernicious ones like signed char).

Checked in yesterday in b972313d09224f2747f1f4c03c53167b03ce4e40, improved diagnostic today with 98ae2b9d1734e7e5bf667ca10962d95abf7e5fb8. Here's an example diagnostic:

test.cpp2(7,8): error: 'signed char' - did you mean 'i8' (usually best) or '__schar'?
test.cpp2(7,8): error: 'signed char' is an old-style C/C++ multi-word keyword type
    - most such types should be used only for interoperability with older code
    - using those when you need them is fine, but name them with these short names instead:
        short, ushort, int, uint, long, ulong, longlong, ulonglong, longdouble, __schar, __uchar
    - see also cpp2util.h > "Convenience names for integer types"

I decided to make the names for explicitly-signed/unsigned char extra ugly, because a character is not an integer (even though C conflates them). In Cpp2 I'm aiming toward making "character" separate from "numeric", including eventually allowing mathematical operations only on numeric types, and character operations only on character types. It's just a category error to do math with letters (insert joke here about how the Romans tried).

Jan 02 '23 18:01 hsutter

cppfront cppfront copied to clipboard

[BUG] the grammar comments are not always correct

Add support for fixed-width integer type aliases (i32, u64, etc.)

cppfront
cppfront copied to clipboard