albertan017
albertan017
Also, we recommend to try these projects: [Machine Language Model](https://mlm.lingyiwanwu.com/) [BinaryAI](https://www.binaryai.cn/single-file) They're fascinating and very powerful!
Thank you for your interest in our project! The dataset we've provided is meant for evaluation purposes. As for training materials, please refer to [Anghabench](https://github.com/brenocfg/AnghaBench) which provides a substantial resource...
Interesting! Good luck for the submission!
Aligning the input and output of a large language model isn't achievable unless we tailor the training process (similar to how objdump -d -S pairs one line of source code...
模型使用可以用hf的model.generate,也可以用vllm推理(参考[evaluation脚本](https://github.com/albertan017/LLM4Decompile/blob/main/evaluation/run_evaluation_llm4decompile_vllm.py)),或者转换成gguf格式(hf也有朋友已经转换成该格式,我们暂时还没有尝试过)使用ollama/lmstudio推理。demo只是展示最直观简便的方式预处理-输入模型-输出结果。 内容不全应该是outputs = model.generate(**inputs, max_new_tokens=2048)设置。但训练样本大多是2K附近长度,4k长函数估计不及预期。我们在准备更长更强的模型。
> > 模型使用可以用hf的model.generate,也可以用vllm推理(参考[evaluation脚本](https://github.com/albertan017/LLM4Decompile/blob/main/evaluation/run_evaluation_llm4decompile_vllm.py)),或者转换成gguf格式(hf也有朋友已经转换成该格式,我们暂时还没有尝试过)使用ollama/lmstudio推理。demo只是展示最直观简便的方式预处理-输入模型-输出结果。 > > 内容不全应该是outputs = model.generate(**inputs, max_new_tokens=2048)设置。但训练样本大多是2K附近长度,4k长函数估计不及预期。我们在准备更长更强的模型。 > > max_new_tokens换成4096也是显示不全。 不太能复现你的情况,我尝试了增加max_length,但结果一致。 使用模型:llm4decompile-9b-v2模型 代码: ``` from transformers import AutoTokenizer, AutoModelForCausalLM import torch fileName = "sample" model_path = 'LLM4Binary/llm4decompile-6.7b-v2' #...
> @albertan017 执行evaluation脚本出现:([https://github.com/albertan017/LLM4Decompile/blob/main/evaluation/run_evaluation_llm4decompile_vllm.py)脚本](https://github.com/albertan017/LLM4Decompile/blob/main/evaluation/run_evaluation_llm4decompile_vllm.py)%E8%84%9A%E6%9C%AC) 脚本:python run_evaluation_llm4decompile_vllm.py --model_path ../../llm4decompile-9b-v2 --testset_path ../decompile-eval/decompile-eval-executable-gcc-ghidra.json --gpus 4 --max_total_tokens 8192 --max_new_tokens 512 --repeat 1 --num_workers 16 --gpu_memory_utilization 0.82 --temperature 0 > > 是不是我的脚本参数有问题 这里有什么问题吗?截图中基本复现论文中llm4decompile-9b-v2的结果。 各类的error是编译和执行测试的error(对应的是不能执行的那一部分数据报错,我们没有屏蔽编译和执行中的报错)
Thanks! We're working on Ghidra now, as it's widely employed in RE. Rizin looks also very interesting and we will study it!
No, at the moment we only support x64-Linux. We plan to add ARM-Linux and x64-Windows soon, but we don’t yet have the build-and-test environments for other platforms.
Due to the sequence length constraints of most large language models (LLMs), which typically range from 1,000 to 16,000 tokens, processing extensive inputs directly isn't feasible. It's better to segment...