How to choose the base model?
How to choose a base model? Is a larger model always better? Are there differences between inference models and regular large models as base models?
Yes, larger model (if properly trained) are always better. From our experience, if you only have small amount of data, chat/inference models are better than base model.
Thank you for your share. And I have another question, have you ever thought about splitting the assembly language into LLM-friendly structured text as pre-training data? Will this be more helpful for LLM to understand assembly language?
Yes, you can use Ghidra or IDA’s disassembly output directly—they’ll generate jump labels and even recover strings or variable values. We use objdump simply because it’s much simpler and around 100× faster than Ghidra or IDA. We’ll try to add an IDA assembly version later. Since our emphasis is on decompilation, pseudocode from Ghidra or IDA is preferable to raw assembly.
This is a good suggestion. By the way, I am currently working on decompiling and deobfuscating arm v8 assembly which IDA does not work properly in my scenario. I think IDA's disassembly output loses a lot of information. So I am trying to provide more context to LLM when fine-tuning.I'm not sure if this will make LLM perform better.