Describe the bug
with demo run_llama_int8.py, setting generate_kwargs["do_sample"] to be True, I got the error as follows:
command:
python run_llama_int8.py -m ${MODEL_ID} --quantized-model-path "/workspace/saved_results/best_model.pt" --benchmark --jit --int8-bf16-mixed --num-iter 5 --prompt "hello"
error log:
/opt/conda/lib/python3.9/site-packages/transformers/generation/utils.py:1405: UserWarning: You are calling .generate() with the input_ids
being on a device type different than your model's device. input_ids
is on cpu, whereas the model is on meta. You may experience unexpected behaviors or slower generation. Please make sure that you have put input_ids
to the correct device by calling for example input_ids = input_ids.to('meta') before running .generate()
.
warnings.warn(
Traceback (most recent call last):
File "/lzw/run_llama_int8.py", line 378, in
output = user_model.generate(
File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/transformers/generation/utils.py", line 1485, in generate
return self.sample(
File "/opt/conda/lib/python3.9/site-packages/transformers/generation/utils.py", line 2524, in sample
outputs = self(
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1522, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1531, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/intel_extension_for_pytorch/cpu/transformers/models.py", line 624, in LlamaForCausalLM_forward
outputs = self.model(
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1522, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1531, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/intel_extension_for_pytorch/cpu/transformers/models.py", line 283, in LlamaModel_forward
attention_mask = self._prepare_decoder_attention_mask(
File "/opt/conda/lib/python3.9/site-packages/intel_extension_for_pytorch/cpu/transformers/attentions.py", line 65, in _prepare_decoder_attention_mask
combined_attention_mask = _make_causal_mask(
File "/opt/conda/lib/python3.9/site-packages/intel_extension_for_pytorch/cpu/transformers/attentions.py", line 18, in _make_causal_mask
mask = torch.full(
NotImplementedError: Could not run 'aten::_local_scalar_dense' with arguments from the 'Meta' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::_local_scalar_dense' is only available for these backends: [CPU, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradHIP, AutogradXLA, AutogradMPS, AutogradIPU, AutogradXPU, AutogradHPU, AutogradVE, AutogradLazy, AutogradMTIA, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, AutogradMeta, AutogradNestedTensor, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].
do_sample is an import feature for me.
Versions
[pip3] intel-extension-for-pytorch==2.1.0.dev0+cpu.llm
[pip3] numpy==1.24.1
[pip3] torch==2.1.0.dev20230711+cpu
[pip3] torchaudio==2.1.0.dev20230711+cpu
[pip3] torchvision==0.16.0.dev20230711+cpu
[conda] intel-extension-for-pytorch 2.1.0.dev0+cpu.llm pypi_0 pypi
[conda] numpy 1.24.1 pypi_0 pypi
[conda] torch 2.1.0.dev20230711+cpu pypi_0 pypi
[conda] torchaudio 2.1.0.dev20230711+cpu pypi_0 pypi
[conda] torchvision 0.16.0.dev20230711+cpu pypi_0 pypi