[NeVa] [rank0]: TypeError: matmul(): argument 'input' (position 1) must be Tensor, not TensorProxy
While preparing the benchmark for eager and dynamo using the code from the fork: https://github.com/tfogal/NeMo I get errors for dynamo case.
🐛 Bug
After fixing 1187 NeMo NeVa dynamo throws a new error
model.model = torch.compile(backend=thunder_backend, dynamic=False)(model.model)
it throws:
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 359, in __call__
[rank0]: raise e.with_traceback(None) # noqa: B904
[rank0]: TypeError: matmul(): argument 'input' (position 1) must be Tensor, not TensorProxy
To Reproduce
Steps to reproduce the behavior:
- Clone:
https://github.com/tfogal/NeMo - Use latest lightning-thunder version container
- Install additionally:
python3 -m pip install --no-deps huggingface-hub==0.23.2
python3 -m pip install --no-deps transformers==4.40.2
python3 -m pip install -e .
python3 -m pip install git+https://github.com/NVIDIA/Megatron-LM.git@6dd3a1afa4e26d4d27e58d1e83aaa6ee6e36b477
- Execute:
rm -f /tmp/graph*.log.txt
export HYDRA_FULL_ERROR=1
export THUNDER_ANNOTATE_TRACES=1
export NEMO_THUNDER_NEVA=dynamo
python3 \
./examples/multimodal/multimodal_llm/neva/neva_pretrain.py \
trainer.precision=bf16-mixed \
model.megatron_amp_O2=True \
model.mcore_gpt=False \
trainer.num_nodes=1 \
trainer.devices=1 \
trainer.val_check_interval=10 \
trainer.limit_val_batches=5 \
trainer.log_every_n_steps=1 \
++exp_manager.max_time_per_run=00:00:03:00 \
trainer.max_steps=20 \
model.micro_batch_size=2 \
model.global_batch_size=4 \
model.tensor_model_parallel_size=1 \
model.pipeline_model_parallel_size=1 \
exp_manager.create_checkpoint_callback=False \
model.data.data_path=./data/multimodal/tiny-neva/dummy.json \
model.data.image_folder=./data/multimodal/tiny-neva/images \
model.tokenizer.library=sentencepiece \
model.tokenizer.model=./data/multimodal/tiny-neva/tokenizer_add_special.model \
model.num_layers=2 \
model.hidden_size=5120 \
model.ffn_hidden_size=13824 \
model.num_attention_heads=40 \
model.normalization=rmsnorm \
model.data.num_workers=0 \
model.data.conv_template=llama_2 \
model.mm_cfg.vision_encoder.from_pretrained=openai/clip-vit-large-patch14 \
model.mm_cfg.llm.from_pretrained=null \
model.use_flash_attention=false \
exp_manager.exp_dir=./nemo_neva
Expected behavior
The pretraining should run smoothly.
Environment
As in the container
Additional context
Attaching the full log of the error: dynamo_error_2nd_Oct.txt
this is from torch.ops.higher_order.autograd_function_apply wanting #1134
(As mentioned elsewhere, I think a more timely way to fix would be to follow the torch.autograd.Function lookaside pattern and acquire the fw and bw _interpret_call at tracing time.)
I noticed that if we change the test case here:
https://github.com/Lightning-AI/lightning-thunder/blob/d6455982d5ca6815efd3d7dc0341b4f945f99be5/thunder/tests/test_jit_general.py#L1206-L1217 to use torch.sin(x), it gives the similar error, so it could be the repro:
"TypeError: sin(): argument 'input' (position 1) must be Tensor, not TensorProxy"