transformers Llama4 inference encounter unsupported op in dynamo ?

Llama4 inference encounter unsupported op in dynamo ?

Open HuangChiEn opened this issue 5 months ago • 7 comments

System Info

transformers==4.51.2 torch==2.5.0 torchvision==0.20.0

Miscellnous pkg version Package Version

accelerate 1.6.0 aiofiles 23.2.1 aiohappyeyeballs 2.6.1 aiohttp 3.11.16 aiosignal 1.3.2 altair 5.5.0 annotated-types 0.7.0 antlr4-python3-runtime 4.9.3 anyio 4.9.0 async-timeout 5.0.1 attrs 25.3.0 bitsandbytes 0.45.4 cachetools 5.5.2 certifi 2025.1.31 charset-normalizer 3.4.1 click 8.1.8 contourpy 1.3.1 cycler 0.12.1 datasets 3.5.0 decord 0.6.0 deepspeed 0.14.4 descartes 1.1.0 dill 0.3.8 distro 1.9.0 docker-pycreds 0.4.0 einops 0.6.1 einops-exts 0.0.4 et_xmlfile 2.0.0 exceptiongroup 1.2.2 fastapi 0.115.12 ffmpy 0.5.0 filelock 3.18.0 fire 0.7.0 flash-attn 2.7.3 fonttools 4.56.0 frozenlist 1.5.0 fsspec 2024.12.0 gitdb 4.0.12 GitPython 3.1.44 gradio_client 0.8.1 h11 0.14.0 hf-xet 1.0.3 hjson 3.1.0 httpcore 0.17.3 httpx 0.24.0 huggingface-hub 0.30.2 idna 3.10 imageio 2.37.0 importlib_resources 6.5.2 inquirerpy 0.3.4 Jinja2 3.1.6 jiter 0.9.0 joblib 1.4.2 jsonschema 4.23.0 jsonschema-specifications 2024.10.1 kagglehub 0.3.11 kiwisolver 1.4.8 latex2mathml 3.77.0 markdown-it-py 3.0.0 markdown2 2.5.3 MarkupSafe 3.0.2 matplotlib 3.5.3 mdurl 0.1.2 mpmath 1.3.0 multidict 6.4.3 multiprocess 0.70.16 narwhals 1.32.0 networkx 3.4.2 ninja 1.11.1.4 numpy 1.26.4 nuscenes-devkit 1.1.11 nvidia-cublas-cu12 12.4.5.8 nvidia-cuda-cupti-cu12 12.4.127 nvidia-cuda-nvrtc-cu12 12.4.127 nvidia-cuda-runtime-cu12 12.4.127 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.2.1.3 nvidia-curand-cu12 10.3.5.147 nvidia-cusolver-cu12 11.6.1.9 nvidia-cusparse-cu12 12.3.1.170 nvidia-ml-py 12.570.86 nvidia-nccl-cu12 2.21.5 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.4.127 omegaconf 2.3.0 openai 1.74.0 opencv-python 4.11.0.86 opencv-python-headless 4.11.0.86 openpyxl 3.1.5 orjson 3.10.16 packaging 24.2 pandas 2.2.3 peft 0.10.0 pfzy 0.3.4 pillow 11.1.0 pip 25.0 platformdirs 4.3.7 portalocker 3.1.1 prompt_toolkit 3.0.51 propcache 0.3.1 protobuf 5.29.4 psutil 7.0.0 py-cpuinfo 9.0.0 pyarrow 19.0.1 pycocotools 2.0.8 pydantic 2.10.6 pydantic_core 2.27.2 pydub 0.25.1 Pygments 2.19.1 pynvml 12.0.0 pyparsing 3.2.3 pyquaternion 0.9.9 python-dateutil 2.9.0.post0 python-dotenv 1.1.0 python-multipart 0.0.20 pytz 2025.2 PyYAML 6.0.2 referencing 0.36.2 regex 2024.11.6 requests 2.32.3 rich 13.9.4 rpds-py 0.24.0 ruff 0.11.2 safetensors 0.5.3 scikit-learn 1.2.2 scipy 1.15.2 semantic-version 2.10.0 sentencepiece 0.1.99 sentry-sdk 2.24.1 setproctitle 1.3.5 setuptools 75.8.0 Shapely 1.8.5.post1 shellingham 1.5.4 shortuuid 1.0.13 six 1.17.0 smmap 5.0.2 sniffio 1.3.1 starlette 0.46.1 sty 1.0.6 svgwrite 1.4.3 sympy 1.13.1 tabulate 0.9.0 termcolor 3.0.1 threadpoolctl 3.6.0 tiktoken 0.8.0 timeout-decorator 0.5.0 timm 0.6.13 tokenizers 0.21.1 tomlkit 0.12.0 torch 2.5.0 torchvision 0.20.0 tqdm 4.67.1 transformers 4.51.2 triton 3.1.0 trl 0.16.1 typer 0.15.2 typing_extensions 4.12.2 tzdata 2025.2 urllib3 2.3.0 uvicorn 0.34.0 validators 0.34.0 wandb 0.19.8 wavedrom 2.0.3.post3 wcwidth 0.2.13 websockets 11.0.3 wheel 0.45.1 XlsxWriter 3.2.2 xxhash 3.5.0 yarl 1.19.0

Who can help?

No response

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch

import torch._dynamo
torch._dynamo.config.suppress_errors = True
torch._dynamo.config.capture_scalar_outputs = False

def get_modules():
    model_id = "/data/joseph/.cache/hub/models--meta-llama--Llama-4-Scout-17B-16E-Instruct/snapshots/7dab2f5f854fe665b6b2f1eccbd3c48e5f627ad8"#"meta-llama/Llama-4-Scout-17B-16E-Instruct"
    
    # official setup args
    processor = AutoProcessor.from_pretrained(model_id)
    model = Llama4ForConditionalGeneration.from_pretrained(
        model_id,
        attn_implementation="flex_attention",
        device_map="auto",
        torch_dtype=torch.bfloat16,
    )
    return model, processor

if __name__ == '__main__':
    from huggingface_hub import login
    import os
    # login to HF to load llama4
    #login(os.environ.get("HF_TOKEN"))
    
    model, processor = get_modules()
    url1 = "/data/joseph/llama4_inference/playground/rabbit.jpg"
    url2 = "/data/joseph/llama4_inference/playground/cat_style_layout.png"
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "path": url1},
                {"type": "image", "path": url2},
                {"type": "text", "text": "Can you describe how these two images are similar, and how they differ?"},
            ]
        },
    ]

    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt",
    ).to(model.device)
    
    # bug happens in here!
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
    )
    response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]
    print(response)
    print(outputs[0])

For the dynamo config, i had tried several combinations, but no one works :

torch._dynamo.config.suppress_errors = True and torch._dynamo.config.capture_scalar_outputs = False
Both True (this one seems raise the error in very low level...orz)

Expected behavior

the official code snippet can run...

May 14 '25 04:05 HuangChiEn

transformers transformers copied to clipboard

Llama4 inference encounter unsupported op in dynamo ?

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard