albertan017 comments

Results 61 comments of


                                            albertan017

Concern Regarding Dataset Integrity

Also, we recommend to try these projects: [Machine Language Model](https://mlm.lingyiwanwu.com/) [BinaryAI](https://www.binaryai.cn/single-file) They're fascinating and very powerful!

Dataset

Thank you for your interest in our project! The dataset we've provided is meant for evaluation purposes. As for training materials, please refer to [Anghabench](https://github.com/brenocfg/AnghaBench) which provides a substantial resource...

Introduce ReF-Decompile to enhance end-to-end Decompile.

Interesting! Good luck for the submission!

Line map between output and input text

Aligning the input and output of a large language model isn't achievable unless we tailor the training process (similar to how objdump -d -S pairs one line of source code...

Question，大模型使用问题

模型使用可以用hf的model.generate，也可以用vllm推理（参考[evaluation脚本](https://github.com/albertan017/LLM4Decompile/blob/main/evaluation/run_evaluation_llm4decompile_vllm.py)），或者转换成gguf格式（hf也有朋友已经转换成该格式，我们暂时还没有尝试过）使用ollama/lmstudio推理。demo只是展示最直观简便的方式预处理-输入模型-输出结果。内容不全应该是outputs = model.generate(**inputs, max_new_tokens=2048)设置。但训练样本大多是2K附近长度，4k长函数估计不及预期。我们在准备更长更强的模型。

Question，大模型使用问题

> > 模型使用可以用hf的model.generate，也可以用vllm推理（参考[evaluation脚本](https://github.com/albertan017/LLM4Decompile/blob/main/evaluation/run_evaluation_llm4decompile_vllm.py)），或者转换成gguf格式（hf也有朋友已经转换成该格式，我们暂时还没有尝试过）使用ollama/lmstudio推理。demo只是展示最直观简便的方式预处理-输入模型-输出结果。 > > 内容不全应该是outputs = model.generate(**inputs, max_new_tokens=2048)设置。但训练样本大多是2K附近长度，4k长函数估计不及预期。我们在准备更长更强的模型。 > > max_new_tokens换成4096也是显示不全。不太能复现你的情况，我尝试了增加max_length，但结果一致。使用模型：llm4decompile-9b-v2模型代码： ``` from transformers import AutoTokenizer, AutoModelForCausalLM import torch fileName = "sample" model_path = 'LLM4Binary/llm4decompile-6.7b-v2' #...

Question，大模型使用问题

> @albertan017 执行evaluation脚本出现：([https://github.com/albertan017/LLM4Decompile/blob/main/evaluation/run_evaluation_llm4decompile_vllm.py)脚本](https://github.com/albertan017/LLM4Decompile/blob/main/evaluation/run_evaluation_llm4decompile_vllm.py)%E8%84%9A%E6%9C%AC) 脚本：python run_evaluation_llm4decompile_vllm.py --model_path ../../llm4decompile-9b-v2 --testset_path ../decompile-eval/decompile-eval-executable-gcc-ghidra.json --gpus 4 --max_total_tokens 8192 --max_new_tokens 512 --repeat 1 --num_workers 16 --gpu_memory_utilization 0.82 --temperature 0 > > 是不是我的脚本参数有问题这里有什么问题吗？截图中基本复现论文中llm4decompile-9b-v2的结果。各类的error是编译和执行测试的error（对应的是不能执行的那一部分数据报错，我们没有屏蔽编译和执行中的报错）

albertan017

Concern Regarding Dataset Integrity

Dataset

Introduce ReF-Decompile to enhance end-to-end Decompile.

Line map between output and input text

Question，大模型使用问题

Question，大模型使用问题

Question，大模型使用问题

Merry this with a reverse engineeing framework like Rizin

Will be possible to decompile binaries of motorola 86K?

How train?