LLM4Decompile llm4decompile-ref dataset

Hi,

I am working with the llm4decompile-ref family of models (pseudo->source code) and have 2 questions about the dataset used for training.

Are these models trained solely using the LLM4Binary/decompile-ghidra-100k dataset?
Upon examining this dataset, it appears there may be a significant amount of duplicated data. Could you confirm if this is expected or if there might be errors when handling?

Any clarification on this would be greatly appreciated. Thanks!

Dec 02 '24 20:12 kleinercubs

The LLM4Binary/decompile-ghidra-100k dataset is a sample dataset used for the v2 series models. For training the v2 series, we use a larger dataset consisting of 1 billion tokens (approximately 1.6 million samples) and train for 2 epochs.

Regarding the duplicated data, it's caused by different optimization levels (O0 to O3) applied during the compilation process. Each optimization level can result in slightly different pseudo code representations, leading to duplicates in the dataset.

Dec 03 '24 04:12 albertan017

for larger dataset, do you mean compile AnghaBench first and then Ghidra decompile?

Dec 03 '24 04:12 kleinercubs

We're using the ExeBench with the first 400K functions, which contains the AnghaBench. Yes, compile the bench and decompile by Ghidra.

Dec 03 '24 04:12 albertan017