Error: cannot run benchmark for chatglm2 6b (The same script works for llama2 7b)
Describe the bug
(py310) ubuntu@820e48bcba8b:~/llm$ python run.py --benchmark -m /model/chatglm2_6b/ --dtype bfloat16 --input-tokens 64 --batch-size 1 --num-iter 5 --num-warmup 1 --token-latency
Namespace(model_id='/model/chatglm2_6b/', dtype='bfloat16', input_tokens='64', max_new_tokens=32, prompt=None, config_file=None, greedy=False, ipex=False, deployment_mode=True, torch_compile=False, backend='ipex', profile=False, benchmark=True, num_iter=5, num_warmup=1, batch_size=1, token_latency=True)
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00, 6.52it/s]
---- Prompt size: 64
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:392: UserWarning: do_sample is set to False. However, temperature is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature.
warnings.warn(
Traceback (most recent call last):
File "/home/ubuntu/llm/single_instance/run_generation.py", line 219, in
Versions
Versions of relevant libraries:
[pip3] intel-extension-for-pytorch==2.3.0+git401e950
[pip3] numpy==1.26.3
[pip3] torch==2.3.0.dev20240128+cpu
[conda] intel-extension-for-pytorch 2.3.0+git401e950 pypi_0 pypi
[conda] mkl 2023.1.0 h213fc3f_46344
[conda] numpy 1.26.3 pypi_0 pypi
[conda] torch 2.3.0.dev20240128+cpu pypi_0 pypi
@andyluo7 I will work on reproducing this issue and get back to you with findings.
Did you try passing in THUDM/chatglm2-6b directly as the model?
Issue reproduced. What version of transformers are you using? I have 4.37.0.
I will be working with the team to resolve your issue.
@alexsin368 , i have the same version of transformers 4.37.0 in the docker.
@andyluo7 I found out what's causing the issue. It's when you pass in --token-latency as an input argument. Take a look at lines 211 and 215:
https://github.com/intel/intel-extension-for-pytorch/blob/main/examples/cpu/inference/python/llm/single_instance/run_generation.py#L211-L215
For now, try running without --token-latency. I will escalate this and get it resolved in the next release.
@alexsin368 , i want to get the time to the first token so I used --token-latency.
@andyluo7 As a workaround for now, modify line 211 in run_generation.py script to gen_ids = output for now and it should work.
HI @andyluo7 @alexsin368 I could run chatglm2 model recently, and my library info is shown as below: [conda] intel-extension-for-pytorch 2.3.0+git6047b54 pypi_0 pypi [conda] mkl 2023.1.0 h213fc3f_46344 [conda] numpy 1.26.4 pypi_0 pypi [conda] torch 2.3.0.dev20240128+cpu pypi_0 pypi
I found we may need to do two below steps before running chatglm2 through IPEX LLM:
- upgrade huggingface transformer lib to be the latest
- made the below modification in chatglm2-6B config.json file: "torch_dtype": "float32", on my side, after doing above steps, I could run bf16 and smoothquant successfully.
hope it can help you.
@andyluo7 When running with --token-latency, you also need to add the --ipex argument. For you, does it work when you run:
python run.py --benchmark -m /model/chatglm2_6b/ --dtype bfloat16 --input-tokens 64 --batch-size 1 --num-iter 5 --num-warmup 1 --token-latency --ipex
@andyluo7 we have a PR merged that would give you a warning if you try to use the --token-latency without including the --ipex argument: https://github.com/intel-innersource/frameworks.ai.pytorch.ipex-cpu/pull/2639
If you have no other issues, we can close this GitHub issue.
Issue has been fixed. Closing issue.