intel-extension-for-pytorch icon indicating copy to clipboard operation
intel-extension-for-pytorch copied to clipboard

Error: cannot run benchmark for chatglm2 6b (The same script works for llama2 7b)

Open andyluo7 opened this issue 1 year ago • 9 comments

Describe the bug

(py310) ubuntu@820e48bcba8b:~/llm$ python run.py --benchmark -m /model/chatglm2_6b/ --dtype bfloat16 --input-tokens 64 --batch-size 1 --num-iter 5 --num-warmup 1 --token-latency

Namespace(model_id='/model/chatglm2_6b/', dtype='bfloat16', input_tokens='64', max_new_tokens=32, prompt=None, config_file=None, greedy=False, ipex=False, deployment_mode=True, torch_compile=False, backend='ipex', profile=False, benchmark=True, num_iter=5, num_warmup=1, batch_size=1, token_latency=True) Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00, 6.52it/s] ---- Prompt size: 64 /home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:392: UserWarning: do_sample is set to False. However, temperature is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature. warnings.warn(

Traceback (most recent call last): File "/home/ubuntu/llm/single_instance/run_generation.py", line 219, in output_tokens_lengths = [x.shape[0] for x in gen_ids] File "/home/ubuntu/llm/single_instance/run_generation.py", line 219, in output_tokens_lengths = [x.shape[0] for x in gen_ids] IndexError: tuple index out of range LLM RUNTIME ERROR: Running generation task failed. Quit.

Versions

Versions of relevant libraries: [pip3] intel-extension-for-pytorch==2.3.0+git401e950 [pip3] numpy==1.26.3 [pip3] torch==2.3.0.dev20240128+cpu [conda] intel-extension-for-pytorch 2.3.0+git401e950 pypi_0 pypi [conda] mkl 2023.1.0 h213fc3f_46344
[conda] numpy 1.26.3 pypi_0 pypi [conda] torch 2.3.0.dev20240128+cpu pypi_0 pypi

andyluo7 avatar Feb 11 '24 01:02 andyluo7

@andyluo7 I will work on reproducing this issue and get back to you with findings.

Did you try passing in THUDM/chatglm2-6b directly as the model?

alexsin368 avatar Feb 12 '24 21:02 alexsin368

Issue reproduced. What version of transformers are you using? I have 4.37.0.

I will be working with the team to resolve your issue.

alexsin368 avatar Feb 13 '24 01:02 alexsin368

@alexsin368 , i have the same version of transformers 4.37.0 in the docker.

andyluo7 avatar Feb 13 '24 01:02 andyluo7

@andyluo7 I found out what's causing the issue. It's when you pass in --token-latency as an input argument. Take a look at lines 211 and 215:

https://github.com/intel/intel-extension-for-pytorch/blob/main/examples/cpu/inference/python/llm/single_instance/run_generation.py#L211-L215

For now, try running without --token-latency. I will escalate this and get it resolved in the next release.

alexsin368 avatar Feb 14 '24 00:02 alexsin368

@alexsin368 , i want to get the time to the first token so I used --token-latency.

andyluo7 avatar Feb 14 '24 01:02 andyluo7

@andyluo7 As a workaround for now, modify line 211 in run_generation.py script to gen_ids = output for now and it should work.

alexsin368 avatar Feb 14 '24 02:02 alexsin368

HI @andyluo7 @alexsin368 I could run chatglm2 model recently, and my library info is shown as below: [conda] intel-extension-for-pytorch 2.3.0+git6047b54 pypi_0 pypi [conda] mkl 2023.1.0 h213fc3f_46344 [conda] numpy 1.26.4 pypi_0 pypi [conda] torch 2.3.0.dev20240128+cpu pypi_0 pypi

I found we may need to do two below steps before running chatglm2 through IPEX LLM:

  1. upgrade huggingface transformer lib to be the latest
  2. made the below modification in chatglm2-6B config.json file: "torch_dtype": "float32", on my side, after doing above steps, I could run bf16 and smoothquant successfully.

hope it can help you.

zhangnju avatar Feb 19 '24 06:02 zhangnju

@andyluo7 When running with --token-latency, you also need to add the --ipex argument. For you, does it work when you run:

python run.py --benchmark -m /model/chatglm2_6b/ --dtype bfloat16 --input-tokens 64 --batch-size 1 --num-iter 5 --num-warmup 1 --token-latency --ipex

alexsin368 avatar Feb 21 '24 22:02 alexsin368

@andyluo7 we have a PR merged that would give you a warning if you try to use the --token-latency without including the --ipex argument: https://github.com/intel-innersource/frameworks.ai.pytorch.ipex-cpu/pull/2639

If you have no other issues, we can close this GitHub issue.

alexsin368 avatar Mar 20 '24 22:03 alexsin368

Issue has been fixed. Closing issue.

alexsin368 avatar Apr 17 '24 22:04 alexsin368