OrangeC Modify the tokenizer to be reusable with different keywords for the same project.

I'm mostly creating this issue as a reminder for me and a heads up, but the tokenizer needs to be reworked so that we can perform operations such as those I want to perform with the embed code, so that we can properly lex the relevant line (since the embed code needs to be properly parsed in order to generate syntax checking since we're only given a line, and the rest of the line isn't lexed).

I'm thinking about making it a series of statically compiled templates, where the code is generated from the header as a header-only thing so that when we have a specialization of the unsigned_map we can just have it automatically generate.

Thoughts on this redesign before I go ahead and do it?

Jul 23 '23 14:07 chuggafan

well it looks like the intent of the current design is you would just instantiate a new object of type Tokenizer and give the constructor whatever keyword table you want to give it... ocpp uses several different keyword tables that way... for example in expressions, in #define processing and the keywords for all the preprocessor directives...

the main problem with the current approach though is there is a single enumeration for all the keywords across all the different keyword tables so I could see maybe using a template to make that better lol...

is there something else that you think needs to be addressed with this?

Jul 23 '23 21:07 LADSoft

Nah, not really, and you summarized it pretty well, I'm just being extremely lazy and need to give myself a permanent reminder since doing this would mean I would also need to transition the entire file to being header-only...

Jul 23 '23 21:07 chuggafan

hm just remembered there is an added complication; the assembler and rc compiler also use variations of the tokenizer. I don't think it is used elsewhere but not 100% sure at this point... I think they did a wholesale copy of Token.cpp, probably so they could use a different enumeration for the keywords... so this change could be a very good thing for the project overall :smile:

Jul 23 '23 21:07 LADSoft

whenever you get around to doing this I've got a request: the current design is very awkward about functions about the start of an identifier and characters that can be in an identifier... maybe another template parameter to specify such would be good instead of the hack I've got going now... and along with that the main tokenizer should probably be passed in to ocpp as it differs across different projects... so maybe inheritance or something similar to make them look the same is in order... sorry for piling all this on just thinking of what might make it better....

Jul 23 '23 23:07 LADSoft

So I've actually begun to work on this and it seems to be a far more invasive change than I first thought, especially because I am also doing the templating of the identifier and characters inside of it, because fundamentally this means large portions of ocpp need to be re-written as a header-only library and anything that relies on it needs to be as such... so things like ppCond, ppDefine, ppMacro, ppPragma, etc. that all use new constructions of the Tokenizer, so this means this change will bleed into the entirety of preprocessor creation, so I might transition large portions of the library to being header only.

Bad news: this is going to cause extremely bad compilation penalties for ocpplib and anything that uses it.

Jul 30 '23 19:07 chuggafan

@chuggafan

Thi sseems worthwhile to do. There is no problem about slowing compiles, hopefully I will eventually get it compiling faster and this will give me impetuous to work on it lol... there isn't anything that is obvious to me at this point other than new/delete as most functions seem to take small amounts of time in their own local processing....

Just so you are aware I've got a couple of branches going.

One is a working branch for delayload, which is limited to dlpe and olink and some library stuff, as well as some changes to the test makefiles.

Another is a working branch for changes for C23. But it has some other stuff as well, for example I switched a lot of enumerations to 'scoped' (in the compiler only) and I took your suggestion and added a new enumeration Dialect to get rid of all those bools for all the different language veersions.

Dialect is in a file ppCommon.h in ocpp, and the changes to get rid of bools are handled in oasm, ocpp, occ, and orc... so that might affect any massive rewrite you are doing to ocpp... maybe I should merge what I've got so far as I think I'm done with preprocessor changes anyway... but I need to test more first. Let me know what you want me to do...

here is the Dialect class:

enum class Dialect
{
    none,
    oasm,
    orc,
    c89,
    c99,
    c11,
    c2x,
    cpp11,
    cpp14,
    cpp17
};

Jul 31 '23 01:07 LADSoft

i thought i should push it... I'm compiling the libcxx tests, if that passes I will push the current work in a bit...

Aug 01 '23 00:08 LADSoft

Works for me, if you have it on a separate branch you can always try to push the separate branch as well.

Aug 01 '23 00:08 chuggafan

I figured I would just push main to keep us from getting way out of sync...

Aug 01 '23 01:08 LADSoft

Work is still progressing with this, slowly, it's just an ton of very annoying changes that mean I have to basically obliterate any static variables and move them elsewhere if possible. So it even means a redesign of my #embed system (slightly) so that the preprocessor will automatically generate the correct embedder when we compile....

sigh this is more complex than I originally anticipated sadly and is a very invasive change, nothing on my end can compile yet due to the length of time this will take (I am being lazy about it partially I will be honest).

Aug 23 '23 23:08 chuggafan

Thank you for taking this on. I think cleaning up the source code is as important as anything else we do :)

Aug 24 '23 02:08 LADSoft