LLaDA icon indicating copy to clipboard operation
LLaDA copied to clipboard

Is there a way to speed up testing?

Open Kairong-Han opened this issue 6 months ago • 3 comments

Great job, but I have a few small questions I'd like to ask.

I used the provided eval.sh script for evaluation. Taking GSM8K as an example, I ran the following evaluation command:

accelerate launch --main_process_port 29501 eval_llada.py --tasks m8k --model llada_dist --model_args model_path='LLaDA',gen_length=256,steps=256,block_length=256

However, I found the evaluation process to be extremely slow. Even when gen_length and other parameters are set to 256, evaluating the model on two A100 GPUs takes around 6 hours. When gen_length is set to the default value of 1024, it takes up to 44 hours. Is this behavior normal? If so, are there any existing optimization methods that can help?

Kairong-Han avatar Jun 25 '25 05:06 Kairong-Han

Good question. I am also interested in accelerating the inference process of the diffusion model.

A paper published two weeks ago compared the decoding speeds of various diffusion models on GSM8K and other datasets. According to their report, a single NVIDIA RTX 4090 GPU achieves 4.55 TPS on GSM8K. With a test set of 1.32k instances at 256 tokens each, the theoretical runtime on one 4090 is about 20.7 hours. You mentioned running on two A100 GPUs in around 6 hours, which is roughly consistent with their findings.

The LLaDA work did not conduct quantitative analysis on inference speed. I think this line of work will be of good value.

Rachum-thu avatar Jun 25 '25 23:06 Rachum-thu

Thanks for your interest!

A lot of papers focus on the fast sampling of LLaDA. Please refer to [Slow-Fast-Sampling], dLLM-cache, Fast-dLLM for details.

nieshenx avatar Jun 30 '25 02:06 nieshenx

Do we need to caculate the tps by ourselves? Because I did not found this logic in this repo.

WillWu111 avatar Aug 01 '25 04:08 WillWu111