GPTQ-for-LLaMa Errors encountered when running benchmark FP16 baseline on multiple GPUs

Trying to run FP16 baseline benchmark for LLaMA 30B model on a server with 8 V100 32GB GPUs：

CUDA_VISIBLE_DEVICES=0,1 python llama.py /dev/shm/ly/models/hf_converted_llama/30B/ wikitext2 --benchmark 2048 --check

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:49<00:00, 7.09s/it] Using the latest cached version of the module from /home/ly/.cache/huggingface/modules/datasets_modules/datasets/wikitext/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126 (last modified on Tue Apr 11 15:29:08 2023) since it couldn't be found locally at wikitext., or remotely on the Hugging Face Hub. Found cached dataset wikitext (/home/ly/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126) Using the latest cached version of the module from /home/ly/.cache/huggingface/modules/datasets_modules/datasets/wikitext/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126 (last modified on Tue Apr 11 15:29:08 2023) since it couldn't be found locally at wikitext., or remotely on the Hugging Face Hub. Found cached dataset wikitext (/home/ly/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126) Benchmarking ... Traceback (most recent call last): File "/home/ly/GPTQ-for-LLaMa/llama.py", line 492, in benchmark(model, input_ids, check=args.check) File "/home/ly/GPTQ-for-LLaMa/llama.py", line 411, in benchmark out = model(input_ids[:, i:i + 1], past_key_values=cache['past'], attention_mask=attention_mask[:, :(i + 1)].reshape((1, -1))) File "/home/ly/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ly/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 687, in forward outputs = self.model( File "/home/ly/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ly/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 577, in forward layer_outputs = decoder_layer( File "/home/ly/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(*args, **kwargs) File "/home/ly/GPTQ-for-LLaMa/llama.py", line 351, in forward tmp = self.module(*inp, **kwargs) File "/home/ly/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ly/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/home/ly/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ly/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 202, in forward query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) File "/home/ly/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 134, in apply_rotary_pos_emb cos = cos[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim] RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:1)

May 27 '23 08:05 foamliu

accelerate nailed it: acc

May 27 '23 09:05 foamliu

Still not working. Replace codes as mentioned, I got the following error:

May 31 '23 01:05 nyg2017

GPTQ-for-LLaMa GPTQ-for-LLaMa copied to clipboard

Errors encountered when running benchmark FP16 baseline on multiple GPUs

GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard