LLaDA icon indicating copy to clipboard operation
LLaDA copied to clipboard

作者您好,感谢开源。今天看到了同样是扩散模型的Mercury Coder,有一点想法

Open leonardoshen-ch opened this issue 8 months ago • 2 comments

Mercury Coder在代码方面似乎超过了传统大语言模型,我发现llada也在代码、数学方面也是超过同等规模的llama 2。 请问您是有专门优化过这方面的能力吗?还是说扩散模型“天生”有这种优势?

leonardoshen-ch avatar Feb 27 '25 09:02 leonardoshen-ch

我们之前有一系列工作讨论过diffusion as a reasoning model,欢迎了解 & 讨论。

  1. Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models 在Text Diffusion Model上做CoT reasoning;
  2. Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning 在24点、数独等任务上证明小规模Diffusion Model相对AR Model的优势,主要是因为Diffusion Model拥有bidirectional信息;
  3. Scaling Diffusion Language Models via Adaptation from Autoregressive Models 在7B size上展现出GSM8K,HumanEval-infilling等task上的优势。

We previously discussed a series of works on diffusion as a reasoning model – feel free to explore and discuss them.

  1. Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models performs chain-of-thought (CoT) reasoning on a Text Diffusion Model.
  2. Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning demonstrates the advantage of small-scale diffusion models over autoregressive (AR) models on tasks like the 24 game and sudoku, mainly because diffusion models incorporate bidirectional information.
  3. Scaling Diffusion Language Models via Adaptation from Autoregressive Models shows advantages on tasks such as GSM8K and HumanEval-infilling at a 7B parameter scale.

summmeer avatar Mar 01 '25 09:03 summmeer

感谢你的关注 @leonardoshen-ch ,也感谢 @summmeer 提到了一系列相关工作。

LLaDA没有专门优化过数学和代码能力。关于扩散模型是否天生有这种优势目前社区也没有定论。我提供我个人的一些没有根据的猜测,不一定正确。

Qwen, DeepSeek等code generation的自回归模型,在训练的时候往往会采用FIM(fill-in-middle)的损失函数,而扩散模型天然就支持这类训练方式,所以可能在code generation上扩散模型会有一些独特的优势。@summmeer 也提到了bidirectional建模能够帮助提高24点、数独等任务,可能扩散模型在数学推理上的独特之处还有待挖掘。

nieshenx avatar Mar 03 '25 10:03 nieshenx