mlc-llm icon indicating copy to clipboard operation
mlc-llm copied to clipboard

Very slow time to first token on ROCM

Open Jyers opened this issue 1 year ago • 5 comments

❓ General Questions

When using MLC LLM with ROCM on a Radeon 7900xtx I am noticing a very large time to first token. With context lengths around 4k I'm seeing upwards of 4 or 5 second delays before generation starts running llama3.1 8B at q4f16_1 quantization.

I've been using MLC LLM for quite a while on CUDA with an Nvidia 3080 with great success. I never benchmarked the performance but was using it for a real time assistant and never noticed large delays.

One more factor to consider is that when running on the Nvidia GPU, I was running MLC in a docker container (using the WSL2 integration). However, with the AMD GPU I am now running MLC directly within WSL, this is mostly because it doesn't seem possible currently to pass ROCM support through using the WSL2 integration of docker.

I've tried tweaking the context size and prefill chunk size to no avail. As well as trying different models and getting similar results. My next course of action was going to be trying Vulcan directly in Windows, but based on everything I've read I would expect ROCM to perform better than Vulcan.

Does anyone have any insight into what could cause a slowdown like this?

Jyers avatar Feb 05 '25 20:02 Jyers

After some testing, I found that setting my server to "interactive" mode, made it behave much more like I expected. It was about the same speed for the first response, then it was much much faster. This leads me to think that something is wrong with cache handling in larger batch sizes (>1) on ROCM.

Based on this, I think it is likely to be the same underlying issue as: #2992

Jyers avatar Feb 09 '25 03:02 Jyers

Another small update:

In addition to the impact of the server mode, it also seems the context window size is very important. Based on my testing, if I set the context window size to anything less than double the size of the context I'm actually using, it exhibits the same issue where it seems as though the input is not getting cached at all.

As an example, if I'm using a context of around 4.5k tokens: Server in interactive mode with 8k context window size: First command: 5s delay Subsequent commands: 5s delay Server in interactive mode with 10k context window size: First command: 5s delay Subsequent commands: 0.5s delay

Jyers avatar Feb 10 '25 00:02 Jyers

@MasterJH5574 Do you have any thoughts on this? For now the increased context windows size is an okay workaround, but it is taking up a lot off unnecessary VRAM and also limits me from using larger models.

Jyers avatar Feb 13 '25 02:02 Jyers

I bit the bullet and switched over to Ubuntu 24.04 hoping the WSL was the issue... it was not unfortunately.

I am still experiencing the same behavior. As soon as the total context exceed 50% of the context_window_size, the TTFT skyrockets.

I ran a quick test, using two prompts to bring the context length high enough to show the issue. For reference, the context window size here is to to 8192, hence why the issue begins when the context exceed 4096 tokens. Here is the /metrics output after each prompt.

# /last_finished_request
last_finished_request_prompt_tokens	34
last_finished_request_completion_tokens	1796
last_finished_request_prefill_tokens	14
last_finished_request_decode_tokens	1795
last_finished_request_jump_forward_tokens	0
last_finished_request_prefill_tokens_per_s	122.04077703153521
last_finished_request_decode_tokens_per_s	96.42711830854311
last_finished_request_end_to_end_latency_s	18.729811089000002
last_finished_request_ttft_s	0.114715756
last_finished_request_inter_token_latency_s	0.010428625327951004
# /last_finished_request
last_finished_request_prompt_tokens	1849
last_finished_request_completion_tokens	4359
last_finished_request_prefill_tokens	20
last_finished_request_decode_tokens	4358
last_finished_request_jump_forward_tokens	0
last_finished_request_prefill_tokens_per_s	119.23480633787365
last_finished_request_decode_tokens_per_s	59.53352148080966
last_finished_request_end_to_end_latency_s	73.37019247900001
last_finished_request_ttft_s	0.167736256
last_finished_request_inter_token_latency_s	0.016831886322321635
# /last_finished_request
last_finished_request_prompt_tokens	6219
last_finished_request_completion_tokens	36
last_finished_request_prefill_tokens	6219
last_finished_request_decode_tokens	35
last_finished_request_jump_forward_tokens	0
last_finished_request_prefill_tokens_per_s	701.2731008962613
last_finished_request_decode_tokens_per_s	47.81840293122218
last_finished_request_end_to_end_latency_s	9.600092838
last_finished_request_ttft_s	8.868157059000001
last_finished_request_inter_token_latency_s	0.2666692455

This also seems to at least reveal the source of the long TTFT. As once the 50% limit is reached, the entire history of tokens seems to be getting prefilled. Whereas before that the existing chat history was part of the decode_tokens. I don't know nearly enough about the inner workings of MLC to have any guesses about what could cause this, but I'm hoping this info might have some clues for someone with a better understanding than I.

Jyers avatar Feb 26 '25 00:02 Jyers

I also verified that this exact same issue occurs on Vulkan as well.

Jyers avatar Feb 26 '25 15:02 Jyers