Data filtering codes for training dataset construction?

Open StevenZ904 opened this issue 6 months ago • 1 comments

Hi.

Thanks for your great work! However, I am very interested in the condensation you conducted to reducec Decompile-Bench-Raw to the final Decompile-Bench. I wonder how is the deduplication process implemented and is it possible for you the open source the corresponding codes to facilitate further reproduction. Thank you.

Jul 02 '25 20:07 StevenZ904

Thank you for your interest. The filtering process is fairly straightforward—please see our paper for full details. In brief, we:

Exclude any ASM functions not originating from the current project by matching their source‐file locations against the GitHub repository name.
Within each binary, identify ASM functions derived from the same source and retain only the most similar one, based on the intersection between DWARF-tracked compiled lines and the original source. (you can simplify this step by keeping the longest asm if tracking the dwarf is complex)
Deduplicate the remaining functions using MinHash.

Jul 03 '25 02:07 albertan017