SemanticDiff Support for C/C++ languages

'Support for C and C++ languages Refactoring code sometimes moves code quite a bit around, and a semantically aware tool for diffs could be very useful.

Feb 09 '23 11:02 jornj

Thank you for your feature request.

We have already thought about supporting C / C++ ourselves. In particular C seems quite difficult to implement though, because you often use macros, and thus a pre-processor step is necessary before the source code can be parsed. This would mean it's not just a diff between two files, but you also need all header files (including system headers) to generate the diff. So far we haven't come up with a good solution yet. We will keep you updated in this issue. :)

Regarding your use case, is it a C or C++ project, and in particular, how relevant is support for macro expansion to ensure proper parsing?

Feb 09 '23 14:02 slackner

It is mostly C projects, but some use frameworks that are define-heavy.

I was hoping that you can just treat the macros as 'yet another function call' and avoid expanding them, although I see some issues regarding #ifdefs.

Feb 09 '23 16:02 jornj

In some cases just treating them as function calls works. But not if there is too much ~~template~~ define magic. E.g.,

LIST_FOR_EACH(cursor, &list)
{
    // do something with cursor
}

This wouldn't be valid syntax without the #define to map it to a for loop.

Feb 09 '23 16:02 slackner

Well.. if you don't go looking for the semi-colon, this is valid syntax. It's a line, possibly with a syntax error and then a scope.

The diff tool will quite often see just this, moving from broken code to fixed code.

Feb 10 '23 12:02 jornj

I would love support for C++ in SemanticDiff! C++ has a much more complex syntax that C, with templates all over, but the typical use of macros is much lower, so maybe it would not be that much of a problem than in old C. In addition to special macros, there are #if, #ifdef etc. that are challenging for a semantics parsing. My proposal would be:

consider macros as plain identifiers
have some configuration file where the user can #define its macros like LIST_FOR_EACH
SemanticDiff would preprocess those user defined macros to memory before comparing
the preprocessor evaluates #if conditions based on user macros list (assuming all others are undefined)
the preprocessor converts all code in discarded #if blocks to comments. Those are assumed to be less important and not guaranteed to be valid syntax, but we should not ignore them totally.
the preprocessor also converts all #if, #else, #endif, etc. lines into comments, as they are not to be considered in the syntax parsing, but cannot be completely ignored in the diff.

Jul 11 '23 14:07 prapin

@prapin Thanks for your idea. The main reason we haven't implemented C++ support yet is that no generic parser framework can parse the language correctly. As explained in this StackOverflow answer, macros aren't the only issue. C and C++ are ambiguous in various ways and you can't disambiguate these cases without tracking variable and type declarations. So you either need a specialized parser that does this on the fly or one that creates all possible parsing trees and the correct one needs to be selected in a post processing step. To make matters worse, you can't do this reliably without also parsing all header files and implementing the preprocessor.

We thought about using Clang or GCC directly to get correct AST trees, but that would require a fully functional build environment. This might work to some extent with the VS Code extension, but wouldn't be an option for our GitHub App. As you pointed out yourself, it would still be tricky to handle code parts that get removed by the preprocessor. The build environment might also not be compatible with the displayed diff.

The other option would be to use a best effort approach instead and accept that some parse results will be incorrect (and maybe relax the grammar rules to handle macros better). This could lead to actual changes being classified as an invariance which is something we try very hard to avoid.

Personally, I don't think either approach is really great, but I would love to hear your opinions :-).

Jul 11 '23 16:07 mmueller2012

I'd really like this. I don't think that there is an alternative to a compiler parsing the code, especially with C++. Even Eclipse/CDT is switching to clangd, because templates got to difficult to maintain (For me their killer-feature is supporting build-system- and macro-aware code completion).

As for Github: As we are all developers, I don't think this needs to work out of the box, without user support. An option would be to generate the necessary information for the diff in the CI, generate an artifact and then use that.

Sep 20 '23 07:09 cmorty

I'm not sure the best way to upvote this enhancement request, but I certainly use C++ a lot and a semantic diff would be fantastic. (Yes I'm sure clangd or similar would be required)

Aug 01 '24 22:08 dewilcox