albertan017 comments

Results 61 comments of


                                            albertan017

Training budget estimation

> Are you training on a single node or multiple nodes out of interest? For the 1B model, we use a single node. For larger models, they are typically trained...

Ghidra extension

Thanks for your interest! Integration with Ghidra and IDA Pro is definitely on our roadmap. Currently, we are concentrating on training a new large language model designed for binary analysis....

I wonder if you could share some experience on colllecting dataset

We've only found AnghaBench and Exebench, which cover nearly all available C libraries. If you have specific requirements, you might need to manually compile larger projects like Linux. While it's...

可以参考下面的issue https://github.com/albertan017/LLM4Decompile/issues/33 目前没有统一的评估方式，我们也在探索不同的评估方法，比如使用gpt评估： https://github.com/albertan017/LLM4Decompile/blob/main/samples/readability_template.txt 如果有其他推荐的评估方式，欢迎讨论交流~

How to Obtain O0-O3 Assembly Code from ExeBench Dataset?

For compilable data, you may follow the [compilation script for AnghaBench](https://github.com/albertan017/LLM4Decompile/blob/main/train/compile.py), with small modification on handling the source of function (exebench_data['func_def']) and its dependency (exebench_data['synth_deps']). For executable data, it's quite...

How to Obtain O0-O3 Assembly Code from ExeBench Dataset?

Yes, in theory, it should be effective. However, we encounter difficulties in generating the appropriate assembly for execution. As a result, we adjust the input to the Wrapper and alter...

How to Obtain O0-O3 Assembly Code from ExeBench Dataset?

As highlighted in our paper, we initially eliminate functions that cannot be executed by testing the executability of the original function (i.e., **use the dataset_row['func_def']**, not the 'decompiled_c_func' in step...

考虑基于项目上下文重建高可读性的反编译代码?

目前的llm并不具备项目级代码理解能力（llm翻译一段话很简单，翻译一个章节明显出现遗忘问题），训练和推理开销也是极其高（不考虑优化，attention计算是输入长度的三次方关系），训练项目级重建成本和难度太高。我们更倾向于单独重构，整合重组：利用好函数自身的信息去重构，再将一个个重构的函数一起送入更强的模型（GPT-o1，Deepseek-R1）去refine。llm4decompile负责做好单个函数，GPT等则擅长从更高层次整合数据

Prediction becomes empty, therefore the loss become nan.

The 9B model is based on [Yi-Coder](https://github.com/01-ai/Yi-Coder), while the training script is from [Deepseek-Coder](https://github.com/deepseek-ai/DeepSeek-Coder). We did not test the 9B model for the script, we recommend to use llama factory...

Concern Regarding Dataset Integrity

2024.5.10 Update: All the evaluations and models are based on executable! enjoy~ ~~Thanks for your interest for our project! Indeed, we're utilizing object files instead of executables, as our training...