mlc-llm [Question] Why read generation config in every decode step?

❓ General Questions

In every DecodeStep(), it call SampleTokenFromLogits() to sample logits, and it will read generation config, which may become a bottleneck for some devices with poor CPU performance. On my Jetson AGX Orin 64GB, it will take about 20ms to read out just 6 variables, while only 6ms for a 3B model forwarding. Since it will be called in every decode step, it become a static cost. My question is in which case we will need different generation configured parameters, since I always use same generation configured parameters for a single request.

BTW, according to nativejson-benchmark picojson is not the SOTA cpp json library, even not a first-class one.

Apr 17 '24 07:04 gesanqiu

Thank you @gesanqiu for reporting this finding. My first impression is likely this is a bug and needs a fix. We will discuss and see how we can address this. We do not need to read the config every time.

Apr 21 '24 02:04 MasterJH5574

Thank you @gesanqiu for reporting this finding. My first impression is likely this is a bug and needs a fix. We will discuss and see how we can address this. We do not need to read the config every time.

Update: My MLC-LLM repo is still in February 21th, my colleague told me that MLC-LLM already support concurrent generation requests right now in server, I will spend some time on this module.

Thanks for you @MasterJH5574 reply, after further investigation, I found that mlc-llm doesn't have a scheduler to manager concurrent generation requests(assuming these requests have different generation config), so if we want to get right sampling request for every separate generation request, llm_chat object should read generation config in every forward procedure.

I already put all the generation parameters into ReadGenerationConfig() and call it in PrefillStep() only once to set up them.

Apr 21 '24 03:04 gesanqiu

@MasterJH5574 Any progress with this issue? I'd love to fix this issue, but I think we need a more robust design, like other framework, we usually need a SamplingParams class. If you have any ideas, please let me know.

Apr 25 '24 01:04 gesanqiu

the latest MLCEngine should support concurrent generation and read config ones, see #2217

May 11 '24 03:05 tqchen

mlc-llm mlc-llm copied to clipboard

[Question] Why read generation config in every decode step?

❓ General Questions

mlc-llm
mlc-llm copied to clipboard