transformers
transformers copied to clipboard
Llama4 inference encounter unsupported op in dynamo ?
System Info
transformers==4.51.2 torch==2.5.0 torchvision==0.20.0
Miscellnous pkg version Package Version
accelerate 1.6.0 aiofiles 23.2.1 aiohappyeyeballs 2.6.1 aiohttp 3.11.16 aiosignal 1.3.2 altair 5.5.0 annotated-types 0.7.0 antlr4-python3-runtime 4.9.3 anyio 4.9.0 async-timeout 5.0.1 attrs 25.3.0 bitsandbytes 0.45.4 cachetools 5.5.2 certifi 2025.1.31 charset-normalizer 3.4.1 click 8.1.8 contourpy 1.3.1 cycler 0.12.1 datasets 3.5.0 decord 0.6.0 deepspeed 0.14.4 descartes 1.1.0 dill 0.3.8 distro 1.9.0 docker-pycreds 0.4.0 einops 0.6.1 einops-exts 0.0.4 et_xmlfile 2.0.0 exceptiongroup 1.2.2 fastapi 0.115.12 ffmpy 0.5.0 filelock 3.18.0 fire 0.7.0 flash-attn 2.7.3 fonttools 4.56.0 frozenlist 1.5.0 fsspec 2024.12.0 gitdb 4.0.12 GitPython 3.1.44 gradio_client 0.8.1 h11 0.14.0 hf-xet 1.0.3 hjson 3.1.0 httpcore 0.17.3 httpx 0.24.0 huggingface-hub 0.30.2 idna 3.10 imageio 2.37.0 importlib_resources 6.5.2 inquirerpy 0.3.4 Jinja2 3.1.6 jiter 0.9.0 joblib 1.4.2 jsonschema 4.23.0 jsonschema-specifications 2024.10.1 kagglehub 0.3.11 kiwisolver 1.4.8 latex2mathml 3.77.0 markdown-it-py 3.0.0 markdown2 2.5.3 MarkupSafe 3.0.2 matplotlib 3.5.3 mdurl 0.1.2 mpmath 1.3.0 multidict 6.4.3 multiprocess 0.70.16 narwhals 1.32.0 networkx 3.4.2 ninja 1.11.1.4 numpy 1.26.4 nuscenes-devkit 1.1.11 nvidia-cublas-cu12 12.4.5.8 nvidia-cuda-cupti-cu12 12.4.127 nvidia-cuda-nvrtc-cu12 12.4.127 nvidia-cuda-runtime-cu12 12.4.127 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.2.1.3 nvidia-curand-cu12 10.3.5.147 nvidia-cusolver-cu12 11.6.1.9 nvidia-cusparse-cu12 12.3.1.170 nvidia-ml-py 12.570.86 nvidia-nccl-cu12 2.21.5 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.4.127 omegaconf 2.3.0 openai 1.74.0 opencv-python 4.11.0.86 opencv-python-headless 4.11.0.86 openpyxl 3.1.5 orjson 3.10.16 packaging 24.2 pandas 2.2.3 peft 0.10.0 pfzy 0.3.4 pillow 11.1.0 pip 25.0 platformdirs 4.3.7 portalocker 3.1.1 prompt_toolkit 3.0.51 propcache 0.3.1 protobuf 5.29.4 psutil 7.0.0 py-cpuinfo 9.0.0 pyarrow 19.0.1 pycocotools 2.0.8 pydantic 2.10.6 pydantic_core 2.27.2 pydub 0.25.1 Pygments 2.19.1 pynvml 12.0.0 pyparsing 3.2.3 pyquaternion 0.9.9 python-dateutil 2.9.0.post0 python-dotenv 1.1.0 python-multipart 0.0.20 pytz 2025.2 PyYAML 6.0.2 referencing 0.36.2 regex 2024.11.6 requests 2.32.3 rich 13.9.4 rpds-py 0.24.0 ruff 0.11.2 safetensors 0.5.3 scikit-learn 1.2.2 scipy 1.15.2 semantic-version 2.10.0 sentencepiece 0.1.99 sentry-sdk 2.24.1 setproctitle 1.3.5 setuptools 75.8.0 Shapely 1.8.5.post1 shellingham 1.5.4 shortuuid 1.0.13 six 1.17.0 smmap 5.0.2 sniffio 1.3.1 starlette 0.46.1 sty 1.0.6 svgwrite 1.4.3 sympy 1.13.1 tabulate 0.9.0 termcolor 3.0.1 threadpoolctl 3.6.0 tiktoken 0.8.0 timeout-decorator 0.5.0 timm 0.6.13 tokenizers 0.21.1 tomlkit 0.12.0 torch 2.5.0 torchvision 0.20.0 tqdm 4.67.1 transformers 4.51.2 triton 3.1.0 trl 0.16.1 typer 0.15.2 typing_extensions 4.12.2 tzdata 2025.2 urllib3 2.3.0 uvicorn 0.34.0 validators 0.34.0 wandb 0.19.8 wavedrom 2.0.3.post3 wcwidth 0.2.13 websockets 11.0.3 wheel 0.45.1 XlsxWriter 3.2.2 xxhash 3.5.0 yarl 1.19.0
Who can help?
No response
Information
- [x] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch
import torch._dynamo
torch._dynamo.config.suppress_errors = True
torch._dynamo.config.capture_scalar_outputs = False
def get_modules():
model_id = "/data/joseph/.cache/hub/models--meta-llama--Llama-4-Scout-17B-16E-Instruct/snapshots/7dab2f5f854fe665b6b2f1eccbd3c48e5f627ad8"#"meta-llama/Llama-4-Scout-17B-16E-Instruct"
# official setup args
processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
model_id,
attn_implementation="flex_attention",
device_map="auto",
torch_dtype=torch.bfloat16,
)
return model, processor
if __name__ == '__main__':
from huggingface_hub import login
import os
# login to HF to load llama4
#login(os.environ.get("HF_TOKEN"))
model, processor = get_modules()
url1 = "/data/joseph/llama4_inference/playground/rabbit.jpg"
url2 = "/data/joseph/llama4_inference/playground/cat_style_layout.png"
messages = [
{
"role": "user",
"content": [
{"type": "image", "path": url1},
{"type": "image", "path": url2},
{"type": "text", "text": "Can you describe how these two images are similar, and how they differ?"},
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
# bug happens in here!
outputs = model.generate(
**inputs,
max_new_tokens=256,
)
response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]
print(response)
print(outputs[0])
For the dynamo config, i had tried several combinations, but no one works :
-
torch._dynamo.config.suppress_errors = Trueandtorch._dynamo.config.capture_scalar_outputs = False -
Both True (this one seems raise the error in very low level...orz)
Expected behavior
the official code snippet can run...