Very slow time to first token on ROCM
❓ General Questions
When using MLC LLM with ROCM on a Radeon 7900xtx I am noticing a very large time to first token. With context lengths around 4k I'm seeing upwards of 4 or 5 second delays before generation starts running llama3.1 8B at q4f16_1 quantization.
I've been using MLC LLM for quite a while on CUDA with an Nvidia 3080 with great success. I never benchmarked the performance but was using it for a real time assistant and never noticed large delays.
One more factor to consider is that when running on the Nvidia GPU, I was running MLC in a docker container (using the WSL2 integration). However, with the AMD GPU I am now running MLC directly within WSL, this is mostly because it doesn't seem possible currently to pass ROCM support through using the WSL2 integration of docker.
I've tried tweaking the context size and prefill chunk size to no avail. As well as trying different models and getting similar results. My next course of action was going to be trying Vulcan directly in Windows, but based on everything I've read I would expect ROCM to perform better than Vulcan.
Does anyone have any insight into what could cause a slowdown like this?
After some testing, I found that setting my server to "interactive" mode, made it behave much more like I expected. It was about the same speed for the first response, then it was much much faster. This leads me to think that something is wrong with cache handling in larger batch sizes (>1) on ROCM.
Based on this, I think it is likely to be the same underlying issue as: #2992
Another small update:
In addition to the impact of the server mode, it also seems the context window size is very important. Based on my testing, if I set the context window size to anything less than double the size of the context I'm actually using, it exhibits the same issue where it seems as though the input is not getting cached at all.
As an example, if I'm using a context of around 4.5k tokens: Server in interactive mode with 8k context window size: First command: 5s delay Subsequent commands: 5s delay Server in interactive mode with 10k context window size: First command: 5s delay Subsequent commands: 0.5s delay
@MasterJH5574 Do you have any thoughts on this? For now the increased context windows size is an okay workaround, but it is taking up a lot off unnecessary VRAM and also limits me from using larger models.
I bit the bullet and switched over to Ubuntu 24.04 hoping the WSL was the issue... it was not unfortunately.
I am still experiencing the same behavior. As soon as the total context exceed 50% of the context_window_size, the TTFT skyrockets.
I ran a quick test, using two prompts to bring the context length high enough to show the issue. For reference, the context window size here is to to 8192, hence why the issue begins when the context exceed 4096 tokens. Here is the /metrics output after each prompt.
# /last_finished_request
last_finished_request_prompt_tokens 34
last_finished_request_completion_tokens 1796
last_finished_request_prefill_tokens 14
last_finished_request_decode_tokens 1795
last_finished_request_jump_forward_tokens 0
last_finished_request_prefill_tokens_per_s 122.04077703153521
last_finished_request_decode_tokens_per_s 96.42711830854311
last_finished_request_end_to_end_latency_s 18.729811089000002
last_finished_request_ttft_s 0.114715756
last_finished_request_inter_token_latency_s 0.010428625327951004
# /last_finished_request
last_finished_request_prompt_tokens 1849
last_finished_request_completion_tokens 4359
last_finished_request_prefill_tokens 20
last_finished_request_decode_tokens 4358
last_finished_request_jump_forward_tokens 0
last_finished_request_prefill_tokens_per_s 119.23480633787365
last_finished_request_decode_tokens_per_s 59.53352148080966
last_finished_request_end_to_end_latency_s 73.37019247900001
last_finished_request_ttft_s 0.167736256
last_finished_request_inter_token_latency_s 0.016831886322321635
# /last_finished_request
last_finished_request_prompt_tokens 6219
last_finished_request_completion_tokens 36
last_finished_request_prefill_tokens 6219
last_finished_request_decode_tokens 35
last_finished_request_jump_forward_tokens 0
last_finished_request_prefill_tokens_per_s 701.2731008962613
last_finished_request_decode_tokens_per_s 47.81840293122218
last_finished_request_end_to_end_latency_s 9.600092838
last_finished_request_ttft_s 8.868157059000001
last_finished_request_inter_token_latency_s 0.2666692455
This also seems to at least reveal the source of the long TTFT. As once the 50% limit is reached, the entire history of tokens seems to be getting prefilled. Whereas before that the existing chat history was part of the decode_tokens. I don't know nearly enough about the inner workings of MLC to have any guesses about what could cause this, but I'm hoping this info might have some clues for someone with a better understanding than I.
I also verified that this exact same issue occurs on Vulkan as well.