Line map between output and input text

Open gratianlup opened this issue 1 year ago • 1 comments

Hi,

Would it be possible to also compute the mapping between the LLM output and the input from Ghidra decompiler as a line map? Something like LLM_OUT_LINES[line_number] = {one or more line numbers from the Ghidra input}.

In your Colab example, the output line: if (fabs(a[i] - a[j]) < eps)

would be mapped to the 3 input lines:

if ((float)(DAT_001020d0 &
                 (uint)(*(float *)(param_2 + (long)local_10 * 4) -
                       *(float *)(param_2 + (long)local_c * 4))) < param_1) {

I'm not sure if something like this can be done with LLMs at all. If doable though, then this project would be really useful for tools like profilers, where one could mark the source lines where most time is spent by mapping assembly instructions to lines with the help of debug info.

Sep 29 '24 07:09 gratianlup

Aligning the input and output of a large language model isn't achievable unless we tailor the training process (similar to how objdump -d -S pairs one line of source code with a few lines of assembly). We plan to explore this line-by-line training approach (asm-src, not ghidra) in future updates for a more versatile chat model, which might take a few months to develop, but we hope it will be beneficial.

We've also observed that a group of smart researchers have done some work which may help your situation; you might want to explore their models.

https://arxiv.org/pdf/2406.17233

Sep 29 '24 07:09 albertan017