Add a script to clean up pseudocode generated by common decompilers to improve Semgrep parsing
Suggested improvements from https://cc-sw.com/semgrep-guide-for-a-security-engineer-part-5-of-6/:
Remove IDA decorators that have shown to cause issues with Semgrep parser:
virtual thunk to
non-virtual thunk to
vtable for
typeinfo for
guard variable for
VTT for
See also: semgrep-vs-decompiler.zip
Ghidra has some decorators which we want to remove from our code files:
__thiscall
__cdecl
__noreturn
__fastcall
[TODO: check if they're indeed problematic and if there are others, consider opening an issue/PR with Semgrep]
Also, handle try/catch/throw construct (IDA) and possibly other C++ stuff (Ghidra, other decompilers) by changing pseudocode file extension to .cpp where appropriate.
EDIT: try scanning the same pseudocode with .c and .cpp extensions with my ruleset, and see what changes. It might be a practical solution to export everything as .cpp regardless of its content, based on how my rules and Semgrep (seem to) work.