pcpp icon indicating copy to clipboard operation
pcpp copied to clipboard

SyntaxError on use in expression of symbol with leading decimal digits

Open willwray opened this issue 3 years ago • 13 comments

Here's a reduced reproducer:

#define Ox 0x
#if Ox
#endif

then pcpp test.h gives

test.h:3 error: Could not evaluate expression
 due to SyntaxError("around token 'x' type CPP_ID") (passed to evaluator: '0x')

It looks like leading decimal digits are eagerly stripped when parsed for the expression.

willwray avatar Nov 22 '22 10:11 willwray

debugpy/launcher 37201 -- -m pcmd test.h

PyInt_FromLong not found.
PyInt_FromLong not found.
PyInt_FromLong not found.
PyInt_FromLong not found.
PyInt_FromLong not found.
PyInt_FromLong not found.
PyInt_FromLong not found.
PyInt_FromLong not found.
test.h:3 error: Could not evaluate expression due to SyntaxError("around token 'x' type CPP_ID") (passed to evaluator: '0x')
PyInt_FromLong not found.

image

willwray avatar Nov 22 '22 11:11 willwray

That's invalid input, and it did give a fairly good hint as to what's invalid about it.

ned14 avatar Nov 22 '22 15:11 ned14

Oops, I was overzealous in reducing the reproducer to less-than minimal... Here's a reproducer that actually preprocesses

#define CAT_(A,B)A##B
#define CAT(A,B)CAT_(A,B)

#define Ox 0x
#if CAT(Ox,0)
#endif

willwray avatar Nov 22 '22 15:11 willwray

It appears that (passed to evaluator: '0x0') is somehow lexed as CPP_INTEGER followed by CPP_ID where it should remain a preprocessor token

willwray avatar Nov 22 '22 15:11 willwray

FYI, the error was hit using pcpp to do codegen with this preprocessing library https://github.com/willwray/IREPEAT in processing 'vertical' repetitions - here's one of the many problematic lines https://github.com/willwray/IREPEAT/blob/master/VREPEATx10.hpp#L11

(it works with gcc, clang, and the new conforming msvc preprocessor)

willwray avatar Nov 22 '22 15:11 willwray

Also FYI, I'm looking at using pcpp to create an amalgamated header (convenient for use on Compiler Explorer via a single #include<url>)

I'm also evaluating if it can create nicer codegen than the native cpp's. It seems to create more empty lines than gcc and clang, but far fewer than msvc.

willwray avatar Nov 22 '22 15:11 willwray

the PyInt_FromLong not found. spam seems to be coming from the debugger - a red herring

willwray avatar Nov 22 '22 15:11 willwray

pcpp lacks a pp-number token (C++ link; same for C11 and C99) so the tokenization is wrongly choosing CPP_INTEGER

> ppint = r'(((((0x)|(0X))[0-9a-fA-F]+)|(\d+))([uU][lL]|[lL][uU]|[uU]|[lL])?)'
> match = re.search(ppint,"0x")
> match.group()
: '0'

when it should choose pp-number as the max-munch

> ppnum = r".?[0-9]([A-Za-z_][\w_]*|[eEpP][-+]|'[a-zA-Z0-9_])*"
> match = re.search(ppnum,"0x")
> match.group()
: '0x'

In phase 3 input is decomposed into preprocessing tokens, then phase 4 executes # directives and recurses back through 1,2,3...

Only in phase 7 are preprocessing tokens converted into tokens for translation.

pcpp only has one set of tokens (I'm trying to hack in a CPP_NUMBER token, no luck yet)

willwray avatar Nov 22 '22 19:11 willwray

Help! Can't work out how to hack it.

Do the lextab.py and parsetab.py tables have to be regenerated? If so, how?

There's a comment on the in_production variable:

in_production = 1  # Set to 0 if editing pcpp implementation!

When set to zero and my edits are still ignored - PLY introspects the new CPP_NUMBER token then it seems to get lost at some point (maybe because the table files are used).

willwray avatar Nov 22 '22 23:11 willwray

Related issue #71, also notes the incorrect parse as glued CPP_INTEGER and CPP_ID.

willwray avatar Nov 23 '22 10:11 willwray

This could be a straightforward fix (still can't work out how to test it).

The current gcc lex.cc only processes CPP_NUMBER.

This 2001 bugfix commit to the C preprocessor c-lex.c (c_lex): Remove CPP_INT, CPP_FLOAT cases

Don't use CPP_INT, CPP_FLOAT; CPP_NUMBER is enough

shows pp-number is sufficient for preprocessor lexing.

Then, for evaluator.py processing of #if conditionals, only "After all macro expansion and evaluation of ... ." "Then the expression is evaluated as an integral constant expression"CPP_INTEGER

The current evaluator should correctly interpret any CPP_INTEGER.

In other words, CPP_INTEGER should be needed only for the evaluator (and where the CPP_INTEGER##CPP_ID combo is a UDL user-defined literal)

Possible issues

  • pp-number is a broad superset that can parse invalid
  • see lex.cc cpp_avoid_paste "avoid an accidental token paste"

willwray avatar Nov 23 '22 12:11 willwray

You may find the ply parser docs at https://www.dabeaz.com/ply/ of use on how it works and generates the precalculated table files.

ned14 avatar Nov 23 '22 16:11 ned14

Related issue in Boost.Wave :wave: BOOST_PP_CAT(1e, -1) pp-token bug fixed early 2006

willwray avatar Nov 23 '22 21:11 willwray