nncf icon indicating copy to clipboard operation
nncf copied to clipboard

[BUG] Phi3 Medium int4 Runtime Error: probability tensor contains either `inf`, `nan` or element < 0

Open fakezeta opened this issue 8 months ago • 6 comments

🐛 Describe the bug

Hi,

Running Phi3 Medium on LocalAI with OpenVINO backend I found that while the int8 quantization is working correctly, the int4 quant gives the following error after few tokens are generated:

12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr Exception in thread Thread-5 (generate):
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr Traceback (most recent call last):
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr   File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr     self.run()
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr   File "/usr/lib/python3.10/threading.py", line 953, in run
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr     self._target(*self._args, **self._kwargs)
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr   File "/build/backend/python/transformers/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr     return func(*args, **kwargs)
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr   File "/build/backend/python/transformers/venv/lib/python3.10/site-packages/optimum/intel/openvino/modeling_decoder.py", line 651, in generate
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr     result = super().generate(
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr   File "/build/backend/python/transformers/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr     return func(*args, **kwargs)
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr   File "/build/backend/python/transformers/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1758, in generate
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr     result = self._sample(
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr   File "/build/backend/python/transformers/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2437, in _sample
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr     next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr   File "/build/backend/python/transformers/venv/lib/python3.10/site-packages/nncf/torch/dynamic_graph/wrappers.py", line 81, in wrapped
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr     op1 = operator(*args, **kwargs)
12:41PM DBG GRPC(fakezeta/Phi-3-medium-4k-instruct-ov-int4-127.0.0.1:43099): stderr RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

the models are https://huggingface.co/fakezeta/Phi-3-medium-4k-instruct-ov-int4 and https://huggingface.co/fakezeta/Phi-3-medium-4k-instruct-ov-int8

Opening here since int8 it's working.

Environment

about-time==4.2.1
accelerate==0.31.0
aiohttp==3.9.5
aiosignal==1.3.1
alive-progress==3.1.5
annotated-types==0.7.0
async-timeout==4.0.3
attrs==23.2.0
autograd==1.6.2
bitsandbytes==0.43.1
certifi==2024.6.2
charset-normalizer==3.3.2
cma==3.2.2
coloredlogs==15.0.1
contourpy==1.2.1
cycler==0.12.1
datasets==2.14.4
deprecated==1.2.14
dill==0.3.7
filelock==3.15.4
fonttools==4.53.0
frozenlist==1.4.1
fsspec==2024.6.0
future==1.0.0
grapheme==0.6.0
grpcio==1.64.0
huggingface-hub==0.23.4
humanfriendly==10.0
idna==3.7
inquirerpy==0.3.4
intel-extension-for-pytorch==2.1.30.post0
intel-extension-for-transformers==1.4.2
jinja2==3.1.4
joblib==1.4.2
jsonschema==4.22.0
jsonschema-specifications==2023.12.1
jstyleson==0.0.2
kiwisolver==1.4.5
markdown-it-py==3.0.0
markupsafe==2.1.5
matplotlib==3.9.0
mdurl==0.1.2
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.15
natsort==8.4.0
networkx==3.3
neural-compressor==2.4.1
ninja==1.11.1.1
nncf==2.11.0
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.5.40
nvidia-nvtx-cu12==12.1.105
onnx==1.16.1
opencv-python-headless==4.10.0.84
openvino==2024.2.0
openvino-telemetry==2024.1.0
openvino-tokenizers==2024.2.0.0
optimum==1.20.0
optimum-intel==1.17.2
packaging==24.1
pandas==2.2.2
pfzy==0.3.4
pillow==10.3.0
prettytable==3.10.0
prompt-toolkit==3.0.47
protobuf==5.27.1
psutil==6.0.0
py-cpuinfo==9.0.0
pyarrow==16.1.0
pycocotools==2.0.8
pydantic==2.7.4
pydantic-core==2.18.4
pydot==2.0.0
pygments==2.18.0
pymoo==0.6.1.1
pyparsing==3.1.2
python-dateutil==2.9.0.post0
pytz==2024.1
pyyaml==6.0.1
referencing==0.35.1
regex==2024.5.15
requests==2.32.3
rich==13.7.1
rpds-py==0.18.1
safetensors==0.4.3
schema==0.7.7
scikit-learn==1.5.0
scipy==1.13.1
sentencepiece==0.2.0
setuptools==69.5.1
six==1.16.0
sympy==1.12.1
tabulate==0.9.0
threadpoolctl==3.5.0
tiktoken==0.7.0
tokenizers==0.19.1
torch==2.1.0.post2+cxx11.abi
tqdm==4.66.4
transformers==4.41.2
triton==2.3.1
typing-extensions==4.12.2
tzdata==2024.1
urllib3==2.2.2
wcwidth==0.2.13
wrapt==1.16.0
xxhash==3.4.1
yarl==1.9.4

Python 3.10.12 Docker image based on Ubuntu 22.04.04 on i5 12600 with 48GB Ram Proxmox VM

Minimal Reproducible Example

Model definition for LocalAI

name: phi3-medium
backend: transformers
parameters:
  model: fakezeta/Phi-3-medium-4k-instruct-ov-int4
context_size: 4096
type: OVModelForCausalLM
template:
  use_tokenizer_template: true
stopwords:
- "<|end|>"
- "<|endoftext|>"

relevant code: Model is loaded with

ovconfig={"PERFORMANCE_HINT": "CUMULATIVE_THROUGHPUT","GPU_DISABLE_WINOGRAD_CONVOLUTION": "YES"}
self.model = OVModelForFeatureExtraction.from_pretrained(model_name, 
                                                                compile=True,
                                                                trust_remote_code=request.TrustRemoteCode,
                                                                ov_config=ovconfig, 
                                                                export=True,
                                                                device=device_map)

self.tokenizer = AutoTokenizer.from_pretrained(model_name, use_safetensors=True)

While inference is done with:

streamer=TextIteratorStreamer(self.tokenizer,
                                        skip_prompt=True,
                                        skip_special_tokens=True)
config=dict(inputs,
                        max_new_tokens=max_tokens, 
                        temperature=request.Temperature, 
                        top_p=request.TopP,
                        top_k=request.TopK, 
                        do_sample=sample,
                        attention_mask=inputs["attention_mask"],
                        eos_token_id=self.tokenizer.eos_token_id,
                        pad_token_id=self.tokenizer.eos_token_id,
                        streamer=streamer,
                        stopping_criteria=criteria,
                        use_cache=True,
                        )
thread=Thread(target=self.model.generate, kwargs=config)
thread.start()

Are you going to submit a PR?

  • [ ] Yes I'd like to help by submitting a PR!

fakezeta avatar Jun 24 '24 13:06 fakezeta