lmdeploy torch engine optimize prefill for long context

Long context would move logits to host, which is time consuming. This PR will not output full logits if no request require return_logits.

XgIBy2ZTet

exp is expensive in cuda (ex2.approx.f32). Replace it with tl.math.fast_expf lPab6Db5gB

internlm2_5-7b, tp2, 949840 context length: 808.1619s

Jul 09 '24 03:07 grimoire

May merge main so that I can use profile_generation.py to benchmark prefill. Here is the result from main branch:

 python profile_generation.py  /mnt/140/InternLM/internlm2_5-7b-chat-1m/ --tp 4 -c 1 -ct 1 -pt 1000000 --session-len 1048576 --backend pytorch -w 0 -tr 1
profiling ... concurrency: 1, n_prompt_token: 1000000, n_completion_token: 1, test_round: 1, warmup_round: 0
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [29:01<00:00, 1741.85s/it]

--------------------------------------------------
total time: 1741.85s
concurrency: 1, test_round: 1
input_tokens: 1000000, output_tokens: 1
first_token latency(min, max, ave): 1739.192s, 1739.192s, 1739.192s
total_token latency(min, max, ave): 1739.192s, 1739.192s, 1739.192s
token_latency percentiles(50%,75%,95%,99%)(s): [1739.192, 1739.192, 1739.192, 1739.192]
throughput(output): 0.0 token/s
throughput(total): 574.1 token/s

Jul 23 '24 04:07 lvhan028

Test 3 rounds:

total time: 3925.31s
concurrency: 1, test_round: 3
input_tokens: 1000000, output_tokens: 1
first_token latency(min, max, ave): 1284.825s, 1352.598s, 1307.773s
total_token latency(min, max, ave): 1284.825s, 1352.598s, 1307.773s
token_latency percentiles(50%,75%,95%,99%)(s): [1307.773, 1307.773, 1307.773, 1307.773]
throughput(output): 0.0 token/s
throughput(total): 764.27 token/s

From 1741.85 to (1284.825s, 1352.598s, 1307.773) respectively.

Jul 23 '24 08:07 lvhan028

It still got room to be improved. A test result from turbomind engine

total time: 431.54s
concurrency: 1, test_round: 1
input_tokens: 1000000, output_tokens: 1
first_token latency(min, max, ave): 431.534s, 431.534s, 431.534s
total_token latency(min, max, ave): 431.534s, 431.534s, 431.534s
token_latency percentiles(50%,75%,95%,99%)(s): [431.534, 431.534, 431.534, 431.534]
throughput(output): 0.0 token/s
throughput(total): 2317.27 token/s

Jul 23 '24 09:07 lvhan028

Without warmup, the prefill performance of turbomind engine is 658s

root@676b9b0ce151:/workspace/lmdeploy/benchmark# python profile_generation.py /workspace/models-140/InternLM/internlm2_5-7b-chat-1m --session-len 1048576 -c 1 -ct 1 -pt 1000000 --cache-max-entry-count 0.7 --tp 4 -w 0 -tr 1
profiling ... concurrency: 1, n_prompt_token: 1000000, n_completion_token: 1, test_round: 1, warmup_round: 0
  0%|                                                                                                                         | 0/1 [00:00<?, ?it/s][WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [10:58<00:00, 658.30s/it]

--------------------------------------------------
total time: 658.30s
concurrency: 1, test_round: 1
input_tokens: 1000000, output_tokens: 1
first_token latency(min, max, ave): 544.858s, 544.858s, 544.858s
total_token latency(min, max, ave): 544.858s, 544.858s, 544.858s
token_latency percentiles(50%,75%,95%,99%)(s): [544.858, 544.858, 544.858, 544.858]
throughput(output): 0.0 token/s
throughput(total): 1519.06 token/s

Jul 23 '24 10:07 lvhan028

--------------------------------------------------
total time: 850.82s
concurrency: 1, test_round: 1
input_tokens: 1000000, output_tokens: 1
first_token latency(min, max, ave): 848.731s, 848.731s, 848.731s
total_token latency(min, max, ave): 848.731s, 848.731s, 848.731s
token_latency percentiles(50%,75%,95%,99%)(s): [848.731, 848.731, 848.731, 848.731]
throughput(output): 0.0 token/s
throughput(total): 1175.34 token/s
--------------------------------------------------

Aug 19 '24 03:08 grimoire

I tried the long-context test described in "long_context.md". It didn't generate any token when the session_len is 800000. But the result shows "fnish_reason" being stop

41221 [Response(text='', generate_token_len=0, input_token_len=767201, session_id=0, finish_reason='stop', token_ids=[], logprobs=None, index=0)]

Aug 22 '24 06:08 lvhan028

I tried the long-context test described in "long_context.md". It didn't generate any token when the session_len is 800000. But the result shows "fnish_reason" being stop
41221 [Response(text='', generate_token_len=0, input_token_len=767201, session_id=0, finish_reason='stop', token_ids=[], logprobs=None, index=0)]

Forget about this case, I need to use tp. @AllentDan, the EngineInstance.async_stream_infer returns an error, but AsyncEngine.generate doesn't handle it.

yield EngineOutput(ResponseType.INPUT_LENGTH_ERROR, [], 0)

I think we need to discuss about how to handle the exceptional case

Aug 22 '24 06:08 lvhan028

We can either output the error message or print the error to the console since the server can't be aborted.

Aug 22 '24 07:08 AllentDan