[Langchain-Chatchat]Add time consumption msg about first token and rest tokens

Open johnysh opened this issue 1 year ago • 1 comments

The current model is unable to calculate the time spent on first token and rest tokens, can we add this msg ?

Apr 02 '24 10:04 johnysh

Hi @johnysh,

Currently, we have not natively supported time consumption msg about first token and rest tokens latency in log for Langchain-Chatchat. However, you could do that with the help of ipex-llm benchmark tool.

To use benchmark tool in Langchain-Chatchat:

Put benchmark_util.py in your conda env for langchain-chatchat:

The path to put the script should be like (taking linux os as an example): /home/<user_name>/<anaconda3 or miniconda3>/envs/<your conda env name>/lib/python3.11/site-packages/ipex_llm/serving/fastchat/benchmark_util.py

In /home/<user_name>/<anaconda3 or miniconda3>/envs/<your conda env name>/lib/python3.11/site-packages/ipex_llm/serving/fastchat/ipex_llm_worker.py, add BenchmarkWrapper to model:

That is, change the code here to:

     self.model, self.tokenizer = load_model(
         model_path, device, self.load_in_low_bit, trust_remote_code
     )

     from .benchmark_util import BenchmarkWrapper
     self.model = BenchmarkWrapper(self.model)

In /home/<user_name>/<anaconda3 or miniconda3>/envs/<your conda env name>/lib/python3.11/site-packages/ipex_llm/serving/fastchat/ipex_llm_worker.py, add print message for 1st and rest token latency.

That is, change the code here to:
```
     print(f"First token latency (s): {self.model.first_cost}", flush=True)
     print(f"Rest token latency (s): {self.model.rest_cost_mean}", flush=True)

     yield json.dumps(json_output).encode() + b"\0"
```

Please let us know for any further problems :)

Apr 03 '24 05:04 Oscilloscope98