ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

[Langchain-Chatchat]Add time consumption msg about first token and rest tokens

Open johnysh opened this issue 1 year ago • 1 comments

The current model is unable to calculate the time spent on first token and rest tokens, can we add this msg ?

johnysh avatar Apr 02 '24 10:04 johnysh

Hi @johnysh,

Currently, we have not natively supported time consumption msg about first token and rest tokens latency in log for Langchain-Chatchat. However, you could do that with the help of ipex-llm benchmark tool.

To use benchmark tool in Langchain-Chatchat:

  1. Put benchmark_util.py in your conda env for langchain-chatchat:

    The path to put the script should be like (taking linux os as an example): /home/<user_name>/<anaconda3 or miniconda3>/envs/<your conda env name>/lib/python3.11/site-packages/ipex_llm/serving/fastchat/benchmark_util.py

  2. In /home/<user_name>/<anaconda3 or miniconda3>/envs/<your conda env name>/lib/python3.11/site-packages/ipex_llm/serving/fastchat/ipex_llm_worker.py, add BenchmarkWrapper to model:

    That is, change the code here to:

         self.model, self.tokenizer = load_model(
             model_path, device, self.load_in_low_bit, trust_remote_code
         )
    
         from .benchmark_util import BenchmarkWrapper
         self.model = BenchmarkWrapper(self.model)
    
  3. In /home/<user_name>/<anaconda3 or miniconda3>/envs/<your conda env name>/lib/python3.11/site-packages/ipex_llm/serving/fastchat/ipex_llm_worker.py, add print message for 1st and rest token latency.

    That is, change the code here to:

         print(f"First token latency (s): {self.model.first_cost}", flush=True)
         print(f"Rest token latency (s): {self.model.rest_cost_mean}", flush=True)
    
         yield json.dumps(json_output).encode() + b"\0"
    

Please let us know for any further problems :)

Oscilloscope98 avatar Apr 03 '24 05:04 Oscilloscope98