LLaDA
LLaDA copied to clipboard
作者您好,感谢开源。今天看到了同样是扩散模型的Mercury Coder,有一点想法
Mercury Coder在代码方面似乎超过了传统大语言模型,我发现llada也在代码、数学方面也是超过同等规模的llama 2。 请问您是有专门优化过这方面的能力吗?还是说扩散模型“天生”有这种优势?
我们之前有一系列工作讨论过diffusion as a reasoning model,欢迎了解 & 讨论。
- Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models 在Text Diffusion Model上做CoT reasoning;
- Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning 在24点、数独等任务上证明小规模Diffusion Model相对AR Model的优势,主要是因为Diffusion Model拥有bidirectional信息;
- Scaling Diffusion Language Models via Adaptation from Autoregressive Models 在7B size上展现出GSM8K,HumanEval-infilling等task上的优势。
We previously discussed a series of works on diffusion as a reasoning model – feel free to explore and discuss them.
- Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models performs chain-of-thought (CoT) reasoning on a Text Diffusion Model.
- Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning demonstrates the advantage of small-scale diffusion models over autoregressive (AR) models on tasks like the 24 game and sudoku, mainly because diffusion models incorporate bidirectional information.
- Scaling Diffusion Language Models via Adaptation from Autoregressive Models shows advantages on tasks such as GSM8K and HumanEval-infilling at a 7B parameter scale.
感谢你的关注 @leonardoshen-ch ,也感谢 @summmeer 提到了一系列相关工作。
LLaDA没有专门优化过数学和代码能力。关于扩散模型是否天生有这种优势目前社区也没有定论。我提供我个人的一些没有根据的猜测,不一定正确。
Qwen, DeepSeek等code generation的自回归模型,在训练的时候往往会采用FIM(fill-in-middle)的损失函数,而扩散模型天然就支持这类训练方式,所以可能在code generation上扩散模型会有一些独特的优势。@summmeer 也提到了bidirectional建模能够帮助提高24点、数独等任务,可能扩散模型在数学推理上的独特之处还有待挖掘。