torch engine optimize prefill for long context
Long context would move logits to host, which is time consuming. This PR will not output full logits if no request require return_logits.
exp is expensive in cuda (ex2.approx.f32). Replace it with tl.math.fast_expf
internlm2_5-7b, tp2, 949840 context length: 808.1619s
May merge main so that I can use profile_generation.py to benchmark prefill.
Here is the result from main branch:
python profile_generation.py /mnt/140/InternLM/internlm2_5-7b-chat-1m/ --tp 4 -c 1 -ct 1 -pt 1000000 --session-len 1048576 --backend pytorch -w 0 -tr 1
profiling ... concurrency: 1, n_prompt_token: 1000000, n_completion_token: 1, test_round: 1, warmup_round: 0
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [29:01<00:00, 1741.85s/it]
--------------------------------------------------
total time: 1741.85s
concurrency: 1, test_round: 1
input_tokens: 1000000, output_tokens: 1
first_token latency(min, max, ave): 1739.192s, 1739.192s, 1739.192s
total_token latency(min, max, ave): 1739.192s, 1739.192s, 1739.192s
token_latency percentiles(50%,75%,95%,99%)(s): [1739.192, 1739.192, 1739.192, 1739.192]
throughput(output): 0.0 token/s
throughput(total): 574.1 token/s
Test 3 rounds:
total time: 3925.31s
concurrency: 1, test_round: 3
input_tokens: 1000000, output_tokens: 1
first_token latency(min, max, ave): 1284.825s, 1352.598s, 1307.773s
total_token latency(min, max, ave): 1284.825s, 1352.598s, 1307.773s
token_latency percentiles(50%,75%,95%,99%)(s): [1307.773, 1307.773, 1307.773, 1307.773]
throughput(output): 0.0 token/s
throughput(total): 764.27 token/s
From 1741.85 to (1284.825s, 1352.598s, 1307.773) respectively.
It still got room to be improved. A test result from turbomind engine
total time: 431.54s
concurrency: 1, test_round: 1
input_tokens: 1000000, output_tokens: 1
first_token latency(min, max, ave): 431.534s, 431.534s, 431.534s
total_token latency(min, max, ave): 431.534s, 431.534s, 431.534s
token_latency percentiles(50%,75%,95%,99%)(s): [431.534, 431.534, 431.534, 431.534]
throughput(output): 0.0 token/s
throughput(total): 2317.27 token/s
Without warmup, the prefill performance of turbomind engine is 658s
root@676b9b0ce151:/workspace/lmdeploy/benchmark# python profile_generation.py /workspace/models-140/InternLM/internlm2_5-7b-chat-1m --session-len 1048576 -c 1 -ct 1 -pt 1000000 --cache-max-entry-count 0.7 --tp 4 -w 0 -tr 1
profiling ... concurrency: 1, n_prompt_token: 1000000, n_completion_token: 1, test_round: 1, warmup_round: 0
0%| | 0/1 [00:00<?, ?it/s][WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [10:58<00:00, 658.30s/it]
--------------------------------------------------
total time: 658.30s
concurrency: 1, test_round: 1
input_tokens: 1000000, output_tokens: 1
first_token latency(min, max, ave): 544.858s, 544.858s, 544.858s
total_token latency(min, max, ave): 544.858s, 544.858s, 544.858s
token_latency percentiles(50%,75%,95%,99%)(s): [544.858, 544.858, 544.858, 544.858]
throughput(output): 0.0 token/s
throughput(total): 1519.06 token/s
--------------------------------------------------
total time: 850.82s
concurrency: 1, test_round: 1
input_tokens: 1000000, output_tokens: 1
first_token latency(min, max, ave): 848.731s, 848.731s, 848.731s
total_token latency(min, max, ave): 848.731s, 848.731s, 848.731s
token_latency percentiles(50%,75%,95%,99%)(s): [848.731, 848.731, 848.731, 848.731]
throughput(output): 0.0 token/s
throughput(total): 1175.34 token/s
--------------------------------------------------
I tried the long-context test described in "long_context.md". It didn't generate any token when the session_len is 800000.
But the result shows "fnish_reason" being stop
41221 [Response(text='', generate_token_len=0, input_token_len=767201, session_id=0, finish_reason='stop', token_ids=[], logprobs=None, index=0)]
I tried the long-context test described in "long_context.md". It didn't generate any token when the
session_lenis 800000. But the result shows "fnish_reason" beingstop41221 [Response(text='', generate_token_len=0, input_token_len=767201, session_id=0, finish_reason='stop', token_ids=[], logprobs=None, index=0)]
Forget about this case, I need to use tp. @AllentDan, the EngineInstance.async_stream_infer returns an error, but AsyncEngine.generate doesn't handle it.
yield EngineOutput(ResponseType.INPUT_LENGTH_ERROR, [], 0)
I think we need to discuss about how to handle the exceptional case
We can either output the error message or print the error to the console since the server can't be aborted.