LLM4Decompile How to choose the base model?

How to choose a base model? Is a larger model always better? Are there differences between inference models and regular large models as base models?

Jun 11 '25 15:06 Magic-King

Yes, larger model (if properly trained) are always better. From our experience, if you only have small amount of data, chat/inference models are better than base model.

Jun 12 '25 03:06 albertan017

Thank you for your share. And I have another question, have you ever thought about splitting the assembly language into LLM-friendly structured text as pre-training data? Will this be more helpful for LLM to understand assembly language?

Jun 16 '25 07:06 Magic-King

Yes, you can use Ghidra or IDA’s disassembly output directly—they’ll generate jump labels and even recover strings or variable values. We use objdump simply because it’s much simpler and around 100× faster than Ghidra or IDA. We’ll try to add an IDA assembly version later. Since our emphasis is on decompilation, pseudocode from Ghidra or IDA is preferable to raw assembly.

Jun 16 '25 11:06 albertan017

This is a good suggestion. By the way, I am currently working on decompiling and deobfuscating arm v8 assembly which IDA does not work properly in my scenario. I think IDA's disassembly output loses a lot of information. So I am trying to provide more context to LLM when fine-tuning.I'm not sure if this will make LLM perform better.

Jun 18 '25 09:06 Magic-King