mlc-llm
mlc-llm copied to clipboard
[Question] Why read generation config in every decode step?
❓ General Questions
In every DecodeStep(), it call SampleTokenFromLogits() to sample logits, and it will read generation config, which may become a bottleneck for some devices with poor CPU performance. On my Jetson AGX Orin 64GB, it will take about 20ms to read out just 6 variables, while only 6ms for a 3B model forwarding. Since it will be called in every decode step, it become a static cost. My question is in which case we will need different generation configured parameters, since I always use same generation configured parameters for a single request.
BTW, according to nativejson-benchmark picojson
is not the SOTA cpp json library, even not a first-class one.
Thank you @gesanqiu for reporting this finding. My first impression is likely this is a bug and needs a fix. We will discuss and see how we can address this. We do not need to read the config every time.
Thank you @gesanqiu for reporting this finding. My first impression is likely this is a bug and needs a fix. We will discuss and see how we can address this. We do not need to read the config every time.
Update: My MLC-LLM repo is still in February 21th, my colleague told me that MLC-LLM already support concurrent generation requests right now in server, I will spend some time on this module.
Thanks for you @MasterJH5574 reply, after further investigation, I found that mlc-llm doesn't have a scheduler to manager concurrent generation requests(assuming these requests have different generation config), so if we want to get right sampling request for every separate generation request, llm_chat object should read generation config in every forward procedure.
I already put all the generation parameters into ReadGenerationConfig()
and call it in PrefillStep()
only once to set up them.
@MasterJH5574 Any progress with this issue? I'd love to fix this issue, but I think we need a more robust design, like other framework, we usually need a SamplingParams class. If you have any ideas, please let me know.
the latest MLCEngine should support concurrent generation and read config ones, see #2217