lmdeploy icon indicating copy to clipboard operation
lmdeploy copied to clipboard

torch engine optimize prefill for long context

Open grimoire opened this issue 1 year ago • 5 comments

Long context would move logits to host, which is time consuming. This PR will not output full logits if no request require return_logits.

XgIBy2ZTet

exp is expensive in cuda (ex2.approx.f32). Replace it with tl.math.fast_expf lPab6Db5gB

internlm2_5-7b, tp2, 949840 context length: 808.1619s

grimoire avatar Jul 09 '24 03:07 grimoire

May merge main so that I can use profile_generation.py to benchmark prefill. Here is the result from main branch:

 python profile_generation.py  /mnt/140/InternLM/internlm2_5-7b-chat-1m/ --tp 4 -c 1 -ct 1 -pt 1000000 --session-len 1048576 --backend pytorch -w 0 -tr 1
profiling ... concurrency: 1, n_prompt_token: 1000000, n_completion_token: 1, test_round: 1, warmup_round: 0
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [29:01<00:00, 1741.85s/it]

--------------------------------------------------
total time: 1741.85s
concurrency: 1, test_round: 1
input_tokens: 1000000, output_tokens: 1
first_token latency(min, max, ave): 1739.192s, 1739.192s, 1739.192s
total_token latency(min, max, ave): 1739.192s, 1739.192s, 1739.192s
token_latency percentiles(50%,75%,95%,99%)(s): [1739.192, 1739.192, 1739.192, 1739.192]
throughput(output): 0.0 token/s
throughput(total): 574.1 token/s

lvhan028 avatar Jul 23 '24 04:07 lvhan028

Test 3 rounds:

total time: 3925.31s
concurrency: 1, test_round: 3
input_tokens: 1000000, output_tokens: 1
first_token latency(min, max, ave): 1284.825s, 1352.598s, 1307.773s
total_token latency(min, max, ave): 1284.825s, 1352.598s, 1307.773s
token_latency percentiles(50%,75%,95%,99%)(s): [1307.773, 1307.773, 1307.773, 1307.773]
throughput(output): 0.0 token/s
throughput(total): 764.27 token/s

From 1741.85 to (1284.825s, 1352.598s, 1307.773) respectively.

lvhan028 avatar Jul 23 '24 08:07 lvhan028

It still got room to be improved. A test result from turbomind engine

total time: 431.54s
concurrency: 1, test_round: 1
input_tokens: 1000000, output_tokens: 1
first_token latency(min, max, ave): 431.534s, 431.534s, 431.534s
total_token latency(min, max, ave): 431.534s, 431.534s, 431.534s
token_latency percentiles(50%,75%,95%,99%)(s): [431.534, 431.534, 431.534, 431.534]
throughput(output): 0.0 token/s
throughput(total): 2317.27 token/s

lvhan028 avatar Jul 23 '24 09:07 lvhan028

Without warmup, the prefill performance of turbomind engine is 658s

root@676b9b0ce151:/workspace/lmdeploy/benchmark# python profile_generation.py /workspace/models-140/InternLM/internlm2_5-7b-chat-1m --session-len 1048576 -c 1 -ct 1 -pt 1000000 --cache-max-entry-count 0.7 --tp 4 -w 0 -tr 1
profiling ... concurrency: 1, n_prompt_token: 1000000, n_completion_token: 1, test_round: 1, warmup_round: 0
  0%|                                                                                                                         | 0/1 [00:00<?, ?it/s][WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [10:58<00:00, 658.30s/it]

--------------------------------------------------
total time: 658.30s
concurrency: 1, test_round: 1
input_tokens: 1000000, output_tokens: 1
first_token latency(min, max, ave): 544.858s, 544.858s, 544.858s
total_token latency(min, max, ave): 544.858s, 544.858s, 544.858s
token_latency percentiles(50%,75%,95%,99%)(s): [544.858, 544.858, 544.858, 544.858]
throughput(output): 0.0 token/s
throughput(total): 1519.06 token/s

lvhan028 avatar Jul 23 '24 10:07 lvhan028

--------------------------------------------------
total time: 850.82s
concurrency: 1, test_round: 1
input_tokens: 1000000, output_tokens: 1
first_token latency(min, max, ave): 848.731s, 848.731s, 848.731s
total_token latency(min, max, ave): 848.731s, 848.731s, 848.731s
token_latency percentiles(50%,75%,95%,99%)(s): [848.731, 848.731, 848.731, 848.731]
throughput(output): 0.0 token/s
throughput(total): 1175.34 token/s
--------------------------------------------------

grimoire avatar Aug 19 '24 03:08 grimoire

I tried the long-context test described in "long_context.md". It didn't generate any token when the session_len is 800000. But the result shows "fnish_reason" being stop

41221 [Response(text='', generate_token_len=0, input_token_len=767201, session_id=0, finish_reason='stop', token_ids=[], logprobs=None, index=0)]

lvhan028 avatar Aug 22 '24 06:08 lvhan028

I tried the long-context test described in "long_context.md". It didn't generate any token when the session_len is 800000. But the result shows "fnish_reason" being stop

41221 [Response(text='', generate_token_len=0, input_token_len=767201, session_id=0, finish_reason='stop', token_ids=[], logprobs=None, index=0)]

Forget about this case, I need to use tp. @AllentDan, the EngineInstance.async_stream_infer returns an error, but AsyncEngine.generate doesn't handle it.

yield EngineOutput(ResponseType.INPUT_LENGTH_ERROR, [], 0)

I think we need to discuss about how to handle the exceptional case

lvhan028 avatar Aug 22 '24 06:08 lvhan028

We can either output the error message or print the error to the console since the server can't be aborted.

AllentDan avatar Aug 22 '24 07:08 AllentDan