ghidra icon indicating copy to clipboard operation
ghidra copied to clipboard

Parsing very big C headers

Open Filip150 opened this issue 2 years ago • 3 comments

Hello I am currently using Ghidra for modding one game (library size 120MB). It's really better when you see data types in game so I just chose File > Parse C Source and imported il2cpp.h. Unfortunately this file has over 2 millions of lines (65+MB of size) and by whole 8 hours ghidra was showing "PreProcessing". I realised that it may be too big so I created special program which converted il2cpp.h into 20 different headers. With first 2 I didnt have problems with parsing. Each one took about 1 minute to preprocess and 1 minute to parse. Unfortunately in 3rd and next headers problem came, because in 1st headers there was object for example "VirtualInvokeData" but in 3rd it is not. All we basically need is to include header 1 and 2 in header 3 and its okay but just longer. Problem comes when I want to parse header 4 or 5. It needs to include all headers: 1 2 and 3. It gives similar effect like parsing whole il2cpp.h at once, sucking at parsing. I attach one of screenshots when failed to parse because "VirtualInvokeData" is undefinied. error

Im thinking if there is any possibility in parse settings to include already improrted, parsed data types, then ghidra will not show error. Or maybe there is way to import successfuly whole il2cpp.h header at once, because there is nothing it shows and just "PreProcessing" for 8 hours so it will basically rather never finish it

Filip150 avatar Jun 26 '22 19:06 Filip150

Are you running ghidra in eclipse in debug mode, which will be very slow because of the exceptions used as part of normal parsing?

There should not be an issue with parsing the whole file. When multiple header files are given it produces a single file from the Preprocessor includes and macro expansion. These files can get rather large. Parsing the windows header files produces essentially a single file of about 72Meg that is then Parsed by the CParser.

My suspicion is there is some sort of infinite loop going on in the parse, most likely in the PreProcessor phase of parsing.

We will be adding a parse using already included data, but that won't help the issue if it is in the C-Preprocessor parsing phase. However if all the datatypes were defined in the single parse, then it would find them anyway.

If you try parsing the whole il2cp.h file again, let it hang, then from the command line run: jps - to get the java pid of ghidra jstack - to dump the stack

If you can post the stack trace for the thread that is running the CParsing, I can take a look to get an idea what construct could be causing the issue.

Are there any strange constructs in the header file? Is any output produced in the CParserPlugin.out file? This is the expanded file that will be sent to the CParser. If you can post the .h file, we can take a look.

(Parsing can be painful)

emteere avatar Jun 29 '22 14:06 emteere

I had a similar issue, and while in version 10.1.2 the parsing completed after a couple of minutes, in version 10.1.4 it needs several hours (but eventually, it is done).

RoppaClown avatar Jul 11 '22 05:07 RoppaClown

I am noticing similar behavior for the ~73 MB header file that I am trying to parse as well. When I used version 10.1.2, it took roughly a minute before it finished preprocessing and gave me the first error during the parsing phase. But when I used version 10.1.5, it took nearly 12 hours for the preprocessing to complete before giving me an error.

Here's the java stack after letting the 10.1.5 preprocessor ran for a couple of minutes: jstack.txt

RainbowUnicorn7297 avatar Sep 08 '22 05:09 RainbowUnicorn7297

This is fixed in an upcoming set of changes. There was a very bad inefficiency in parsing that has been corrected. Parsing is fairly quick now. Visual Studio 2022 header files parse in approximately 3 minutes. An example il2cpp.h file parses in approximately 40 seconds.

Also added is the C++ ability to extend a structure I found in the file. Not full C++, just structure extension. Hopefully the implementation of adding the parent to the top of the child structure will work for you. Only one "parent" is supported currently. In the future when full OO structures are fully supported we'll change to more of an inheritance model.

emteere avatar Oct 13 '22 02:10 emteere

Hi @emteere I'm probably running into the same issue with a 52MiB il2cpp_ghidra.h file as well. Is there something I could try to make it quicker?

Update: Tried current code from mater and it works, thanks :)

DieHertz avatar Oct 15 '22 16:10 DieHertz