LLM4Decompile icon indicating copy to clipboard operation
LLM4Decompile copied to clipboard

How train?

Open alphaonex86 opened this issue 1 year ago • 2 comments

Hi, I wish train on larger base set (like gentoo), and on multiple architecture (RISC-V, ARM, MIPS v2, x86, ...). How do? You code support foreign architecture? I wish too train with old compiler, that's help to analis old unmaintened code on MIPS arch.

alphaonex86 avatar Mar 18 '24 19:03 alphaonex86

Due to the sequence length constraints of most large language models (LLMs), which typically range from 1,000 to 16,000 tokens, processing extensive inputs directly isn't feasible. It's better to segment your data set into smaller, function-level chunks that pair the binary code with its corresponding source code. Once the data is prepared, it can be feed into the LLM for fine-tuning.

Currently, our model is trained to support C language decompilation on the Linux x86_64 architecture. For your interest in working with older compilers, the LLM generally treats input from various compilers similarly, without significant differentiation.

albertan017 avatar Mar 19 '24 03:03 albertan017

I understand totally. But the real world where everybody have blocking unmaintained binary, we have lot of crap and large binary. I don't see too how decompile by part, this imply do previous/next chunk into the token and be able to rewrite the previous writed file. Maybe auto chunk by function.

Currently, our model is trained to support C language decompilation on the Linux x86_64 architecture

Yes, I wish train with more arch because I have some code from router to study (from MIPS) and from gcc 4.6 (kernel modules), obfuscated into multiple .ko and .so for userspace.

alphaonex86 avatar Mar 19 '24 11:03 alphaonex86