mir Decouple the C parser?

Hello!

I am looking for a C parser so that I can generate wrappers for the V project - and since it itself generates C, it'd be interesting to see if I could implement V as a scripting language by utilizing MIR itself. Bit of a back and forth, really. But mainly, I would like to know this: Can I decouple the C parser/AST generator to then read definitions like symbols and structs from the input - and what source files would I need? I was looking around in the c2mir subfolder to find a way to get started.

Thank you in advance :)

Kind regards, Ingwie

Apr 19 '21 17:04 IngwiePhoenix

Thank you for your interest in MIR project.

If you want to generate MIR directly from V compiler, you need only mir.c and mir-gen.c files (and included files by them).

If you need a C front-end w/o generating MIR, it is possible to decouple C front-end. I don't think it is a big job as c2mir is a multipass compiler (prepro, parser, syntax analysis, MIR generation). The front-end is in file c2mir.c. To separate front-end the code for MIR generation should be removed and (augmented) AST would be the final result. You just need to split code which does not work when -fsyntax-only is used.

The AST is pretty simple, it is nested lists with attributes connected to some list nodes. The lists for correct C program is described by regular expressions in comment https://github.com/vnmakarov/mir/blob/803846b7b7294f3b2ae2bc8eb2e506f68b4b7c4c/c2mir/c2mir.c#L455.

I think that separating front-end has sense and it is much easier task for me. I'll evaluate how much will it take and probably start to work on it in May.

I wrote a blogpost devoted to C2MIR compiler and I hope it will be published on RedHat developers blog soon (although the time when it happens does not depend on me). When it will published I'll add a reference from README.md to it.

Apr 20 '21 21:04 vnmakarov

Hello.

Thank you very much for these information - this will be very helpful. So in short, anything that is related to -fsyntax-only is what i need? Good enough for me. All I am after is an AST from which I can extract symbol information (functions, structs, unions, enums, typedefs, ...), so this should do perfectly.

as for V to MIR, that might end up being a side product as I go through the MIR source. Since the aformentioned files are quite self-contained, it'll be interesting to see how far this can go. A while ago, V split it's backend, so working off their C generator and turning it into a MIR generator would probably not be too hard, and probably would help me in learning MIR itself, too.

Again, thanks for all the information!

Apr 21 '21 11:04 IngwiePhoenix

Alright so I have been reading across c2mir.c a lot now and I noticed a few key things:

As the front-end is - aside from MIR generation - very stand-alone, it should be possible to export some of it's API through c2mir.h.
Namely, node_t parse(...) and it's pre(...) counterpart. Suggestion: Wrap or rename these to c2mir_preprocess() and c2mir_parse() respectively, where a c2mir context is the required argument

This implementation of a C preprocessor, lexer/parser and AST generator is extremely powerful and very small. Making it accessible without MIR might have some other useful benefits (i.e.: walking the AST to drop unused functions and other optimizations - or a smaller and tinier implementation of what clang-tidy does or simply for gathering stats). I am quite impressed by that implementation, honestly. I wasn't expecting it to be so extremely loaded in capabilities. Great work there! :)

I'll keep on reading though but I just thought I'd leave my thoughts here as I go along.

Apr 21 '21 12:04 IngwiePhoenix

Alright so I have been reading across c2mir.c a lot now and I noticed a few key things:

* As the front-end is - aside from MIR generation - very stand-alone, it should be possible to export some of it's API through `c2mir.h`.

* Namely, `node_t parse(...)` and it's `pre(...)` counterpart. Suggestion: Wrap or rename these to `c2mir_preprocess()` and `c2mir_parse()` respectively, where a c2mir context is the required argument

The separation might be strait forward but I need to think how to do it best. Besides AST there is a symbol table and part of machine-dependent info used by c2m front-end. I think c2mir.c file separation reflecting prepro and front-end separation would be nice too.

This implementation of a C preprocessor, lexer/parser and AST generator is extremely powerful and very small. Making it accessible without MIR might have some other useful benefits (i.e.: walking the AST to drop unused functions and other optimizations - or a smaller and tinier implementation of what clang-tidy does or simply for gathering stats). I am quite impressed by that implementation, honestly. I wasn't expecting it to be so extremely loaded in capabilities. Great work there! :)

Thank you. I had no requirements for C front-end besides making it simple and structured. Therefore the speed of the parser is far away from the best.

The bulk (e.g. 100K lines of C file) single-threaded speed of c2m front-end is about 1/10 of tcc and 10 times faster than GCC one. The startup time (compilation of small program, e.g. 20-50 lines of pre-processed C code) is the same as tcc one and 10 times faster than GCC which is more important for JIT compilation than the bulk speed. Therefore some people use c2m as interface for JITs instead of direct MIR generation.

To achieve c2m bulk speed as tcc one, the design should be cardinally changed. It should be one (two at most) pass compiler with very compact IR. AST is too big for this, e.g even source location (which includes file, line, position in line) is too big in comparison with tcc (which reports only source line number). The parser should not have back-tracking and use an operator precedence parser. And so on.

Preprocessor of c2m is a classical one using markers and works as it is defined in the C standard. It processes text for macro substitution several times. There is a faster design of the preprocessor based on hidden sets and described in https://www.spinellis.gr/blog/20060626/x3J11-86-196.pdf

In overall, the front-end is not fast for big programs but fast for typical dynamic type language JIT.

I'll keep on reading though but I just thought I'd leave my thoughts here as I go along.

Thank you. I always appreciate any feedback even if it is not nice one. It helps to improve the project.

Unfortunately, the project is moving slowly as I have few time for it. But I'll have some more time soon for the next 5 months.

Apr 21 '21 13:04 vnmakarov