lightning-thunder icon indicating copy to clipboard operation
lightning-thunder copied to clipboard

TypeError: Missing a required argument with thunder.jit in NeMo SD ResBlock

Open athitten opened this issue 1 year ago • 5 comments

🐛 Bug

Adding thunder.jit to ResBlock in the UNet stage of NeMo SD is raising an error. From looking at the ResBlock call in NeMo code, the class is called correctly with right arguments. In-spite of that its unsure why thunder is raising this error.

Encountered exception TypeError: missing a required argument: 'emb' while tracing ResBlock

Stack trace of the error can be found here: resblock_error.log

To Reproduce

Steps to reproduce the behavior:

  1. Pull the appropriate NeMo docker image

  2. Apply the git patch: resblock.patch

  3. Run Stable Diffusion with the command:

python examples/multimodal/text_to_image/stable_diffusion/sd_train.py trainer.precision=16 trainer.num_nodes=1 trainer.devices=1 ++exp_manager.max_time_per_run=00:00:03:00 trainer.max_steps=20 model.micro_batch_size=1 model.global_batch_size=1 model.data.synthetic_data=True exp_manager.exp_dir=/workspace/TestData/multimodal/stable_diffusion_train model.inductor=False model.cond_stage_config._target_=nemo.collections.multimodal.modules.stable_diffusion.encoders.modules.FrozenCLIPEmbedder ++model.cond_stage_config.version=openai/clip-vit-large-patch14 ++model.cond_stage_config.max_length=77 ~model.cond_stage_config.restore_from_path ~model.cond_stage_config.freeze ~model.cond_stage_config.layer model.unet_config.from_pretrained=null model.first_stage_config.from_pretrained=null model.unet_config.use_flash_attention=False model.unet_config.attention_resolutions=\[1\] model.unet_config.channel_mult=\[1\]`

cc: @tfogal

cc @tfogal

athitten avatar Jun 06 '24 20:06 athitten

The same error comes from adding thunder.jit to the subsequent ResBlock here

athitten avatar Jun 06 '24 21:06 athitten

Yikes, this is deep in the interpreter:

  File "/workspace/software/NeMo/examples/multimodal/text_to_image/stable_diffusion/sd_train.py", line 80, in main
    trainer.fit(model)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
...
  File "/workspace/software/lightning-thunder/thunder/__init__.py", line 473, in get_computation_and_inputs
    jit_results: TraceResults = interpreter(
  File "/workspace/software/lightning-thunder/thunder/__init__.py", line 190, in _general_frontend
    return thunder_general_jit(fn, args, kwargs, sharp_edges=sharp_edges, record_history=record_history)
  File "/workspace/software/lightning-thunder/thunder/core/jit_ext.py", line 1529, in thunder_general_jit
    result = jfn(*args, **kwargs)
  File "/workspace/software/lightning-thunder/thunder/core/interpreter.py", line 6692, in fn_
    raise InterpreterError(msg) from e
thunder.core.interpreter.InterpreterError: Encountered exception TypeError: missing a required argument: 'emb' while tracing [snip]

Since it complains about emb, my first thought is that it's related to the embedding layers ("emb_layers"):

... while tracing ResBlock(
  (in_layers): Sequential(
    (0): GroupNorm(32, 320, eps=1e-05, affine=True)
    (1): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  )
  (h_upd): Identity()
  (x_upd): Identity()
  (emb_layers): Sequential(
    (0): SiLU()
    (1): Linear(in_features=1280, out_features=320, bias=True)
  )
  (out_layers): Sequential(
    (0): GroupNorm(32, 320, eps=1e-05, affine=True)
    (1): Dropout(p=0, inplace=False)
    (2): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  )
  (skip_connection): Identity()
):

but neither of those ops take an emb parameter.

Anyway this is deep in code that @t-vi worked on, so I'm going to need to tag him for help. Tom, can you help us identify what went awry here? We're happy to change the input to workaround, but it's not clear what change would help thunder here.

tfogal avatar Jun 06 '24 22:06 tfogal

triage review — @athitten can we provide a minimal example for this issue that @t-vi, who works at Lightning AI, can use to reproduce this failure?

mruberry avatar Jun 10 '24 19:06 mruberry

Would it be possible to re-run this? We have had a lot of fixes around the unpacking of signatures and I would hope that this is fixed or we get a better error message now.

t-vi avatar Aug 06 '24 06:08 t-vi

Would it be possible to re-run this?

Thanks for bringing this up. I reran this using af5e9d6b62f8bafb845c8e257bef701547498b1f. The error is still the same, but the traceback has changed a bit:

Error executing job with overrides: ['trainer.precision=bf16', 'trainer.num_nodes=1', 'trainer.devices=1', '++exp_manager.max_time
_per_run=00:00:03:00', 'trainer.max_steps=20', 'model.micro_batch_size=1', 'model.global_batch_size=1', 'model.optim.name=megatron
_fused_adam', 'model.data.synthetic_data=True', 'exp_manager.exp_dir=./foo-sd-train', 'model.inductor=False', 'model.cond_stage_co
nfig._target_=nemo.collections.multimodal.modules.stable_diffusion.encoders.modules.FrozenCLIPEmbedder', '++model.cond_stage_confi
g.version=openai/clip-vit-large-patch14', '++model.cond_stage_config.max_length=77', '++model.thunder=False', 'model.unet_config.f
rom_pretrained=null', 'model.first_stage_config.from_pretrained=null', 'model.unet_config.use_flash_attention=False', 'model.unet_
config.attention_resolutions=[1]', 'model.unet_config.channel_mult=[1]', 'model.ddp_overlap=False']
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/tfogal/dev/nemo/examples/multimodal/text_to_image/stable_diffusion/sd_train.py", line 117, in <module>
[rank0]:     main()
[rank0]:   File "/home/tfogal/dev/nemo/nemo/core/config/hydra_runner.py", line 129, in wrapper
[rank0]:     _run_hydra(
[rank0]:   File "/home/tfogal/env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
[rank0]:     _run_app(
[rank0]:   File "/home/tfogal/env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
[rank0]:     run_and_report(
[rank0]:   File "/home/tfogal/env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
[rank0]:     raise ex
[rank0]:   File "/home/tfogal/env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
[rank0]:     return func()
[rank0]:   File "/home/tfogal/env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
[rank0]:     lambda: hydra.run(
[rank0]:   File "/home/tfogal/env/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
[rank0]:     _ = ret.return_value
[rank0]:   File "/home/tfogal/env/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
[rank0]:     raise self._return_value
[rank0]:   File "/home/tfogal/env/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
[rank0]:     ret.return_value = task_function(task_cfg)
[rank0]:   File "/home/tfogal/dev/nemo/examples/multimodal/text_to_image/stable_diffusion/sd_train.py", line 112, in main
[rank0]:     trainer.fit(model)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 543, in fit
[rank0]:     call._call_and_handle_interrupt(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank0]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank0]:     return function(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 579, in _fit_impl
[rank0]:     self._run(model, ckpt_path=ckpt_path)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 986, in _run
[rank0]:     results = self._run_stage()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1030, in _run_stage
[rank0]:     self.fit_loop.run()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run
[rank0]:     self.advance()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance
[rank0]:     self.epoch_loop.run(self._data_fetcher)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 140, in run
[rank0]:     self.advance(data_fetcher)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 250, in advance
[rank0]:     batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 190, in run
[rank0]:     self._optimizer_step(batch_idx, closure)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 268, in _optimizer_step
[rank0]:     call._call_lightning_module_hook(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 159, in _call_lightning_module_hook
[rank0]:     output = fn(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/nemo/nemo/collections/nlp/models/language_modeling/megatron_base_model.py", line 1263, in optimizer_step
[rank0]:     super().optimizer_step(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/module.py", line 1308, in optimizer_step
[rank0]:     optimizer.step(closure=optimizer_closure)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/optimizer.py", line 153, in step
[rank0]:     step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 270, in optimizer_step
[rank0]:     optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 238, in optimizer_step
[rank0]:     return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/amp.py", line 74, in optimizer_step
[rank0]:     return super().optimizer_step(optimizer, model=model, closure=closure, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/precision.py", line 122, in optimizer_step
[rank0]:     return optimizer.step(closure=closure, **kwargs)
[rank0]:   File "/home/tfogal/dev/nemo/nemo/collections/common/callbacks/ema.py", line 250, in step
[rank0]:     loss = self.optimizer.step(closure)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 136, in wrapper
[rank0]:     return func.__get__(opt, opt.__class__)(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 478, in wrapper
[rank0]:     out = func(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/nemo/nemo/core/optim/megatron_fused_adam.py", line 58, in step
[rank0]:     return super().step(closure=closure, grad_scaler=grad_scaler)
[rank0]:   File "/home/tfogal/env/lib/python3.10/site-packages/apex/optimizers/fused_adam.py", line 140, in step
[rank0]:     loss = closure()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/precision.py", line 108, in _wrap_closure
[rank0]:     closure_result = closure()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 144, in __call__
[rank0]:     self._result = self.closure(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 129, in closure
[rank0]:     step_output = self._step_fn()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 317, in _training_step
[rank0]:     training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 311, in _call_strategy_hook
[rank0]:     output = fn(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 389, in training_step
[rank0]:     return self._forward_redirection(self.model, self.lightning_module, "training_step", *args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 640, in __call__
[rank0]:     wrapper_output = wrapper_module(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1735, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1746, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1640, in forward
[rank0]:     else self._run_ddp_forward(*inputs, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1456, in _run_ddp_forward
[rank0]:     return self.module(*inputs, **kwargs)  # type: ignore[index]
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1735, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1746, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 633, in wrapped_forward
[rank0]:     out = method(*_args, **_kwargs)
[rank0]:   File "/home/tfogal/dev/nemo/nemo/utils/model_utils.py", line 434, in wrap_training_step
[rank0]:     output_dict = wrapped(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/nemo/nemo/collections/multimodal/models/text_to_image/stable_diffusion/ldm/ddpm.py", line 1812, in training_step
[rank0]:     loss_mean, loss_dict = self.fwd_bwd_step(dataloader_iter, False)
[rank0]:   File "/home/tfogal/dev/nemo/nemo/collections/multimodal/models/text_to_image/stable_diffusion/ldm/ddpm.py", line 1745, in fwd_bwd_step
[rank0]:     losses_reduced_per_micro_batch = fwd_bwd_function(
[rank0]:   File "/home/tfogal/env/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 460, in forward_backward_no_pipelining
[rank0]:     output_tensor, num_tokens = forward_step(
[rank0]:   File "/home/tfogal/env/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 266, in forward_step
[rank0]:     output_tensor, loss_func = forward_step_func(data_iterator, model)
[rank0]:   File "/home/tfogal/dev/nemo/nemo/collections/multimodal/models/text_to_image/stable_diffusion/ldm/ddpm.py", line 1939, in fwd_output_and_loss_func
[rank0]:     loss, loss_dict = model(x, c)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1735, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1746, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/nemo/nemo/collections/multimodal/models/text_to_image/stable_diffusion/ldm/ddpm.py", line 1015, in forward
[rank0]:     return self.p_losses(x, c, t, *args, **kwargs)
[rank0]:   File "/home/tfogal/dev/nemo/nemo/collections/multimodal/models/text_to_image/stable_diffusion/ldm/ddpm.py", line 1165, in p_losses
[rank0]:     model_output = self.apply_model(x_noisy, t, cond)
[rank0]:   File "/home/tfogal/dev/nemo/nemo/collections/multimodal/models/text_to_image/stable_diffusion/ldm/ddpm.py", line 1136, in apply_model
[rank0]:     x_recon = self.model(x_noisy, t, **cond)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1735, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1746, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/nemo/nemo/collections/multimodal/models/text_to_image/stable_diffusion/ldm/ddpm.py", line 2340, in forward
[rank0]:     out = self.diffusion_model(x, t, context=cc)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1735, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1746, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/nemo/nemo/collections/multimodal/modules/stable_diffusion/diffusionmodules/openaimodel.py", line 1323, in forward
[rank0]:     out = self._forward(x, timesteps, context, y, **kwargs)
[rank0]:   File "/home/tfogal/dev/nemo/nemo/collections/multimodal/modules/stable_diffusion/diffusionmodules/openaimodel.py", line 1308, in _forward
[rank0]:     h = module(h, emb, context)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1735, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1746, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/nemo/nemo/collections/multimodal/modules/stable_diffusion/diffusionmodules/openaimodel.py", line 151, in forward
[rank0]:     x = layer(x)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1735, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1746, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/thunder/thunder/core/module.py", line 61, in forward
[rank0]:     res = self._forward_fn(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/thunder/thunder/__init__.py", line 743, in fn_
[rank0]:     cache_entry, inps, pro_to_epi = get_computation_and_inputs(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/thunder/thunder/__init__.py", line 223, in cache_info_wrapper
[rank0]:     res = fn(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/thunder/thunder/__init__.py", line 522, in get_computation_and_inputs
[rank0]:     jit_results: TraceResults = interpreter(
[rank0]:   File "/home/tfogal/dev/thunder/thunder/__init__.py", line 211, in _general_frontend
[rank0]:     return thunder_general_jit(fn, args, kwargs, sharp_edges=sharp_edges, record_history=record_history)
[rank0]:   File "/home/tfogal/dev/thunder/thunder/core/jit_ext.py", line 1767, in thunder_general_jit
[rank0]:     result = jfn(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/thunder/thunder/core/interpreter.py", line 7088, in fn_
[rank0]:     raise e
[rank0]:   File "/home/tfogal/dev/thunder/thunder/core/interpreter.py", line 7056, in fn_2
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/thunder/thunder/core/interpreter.py", line 6379, in _impl
[rank0]:     return fn.__func__(fn.__self__, *args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1735, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/thunder/thunder/core/interpreter.py", line 6379, in _impl
[rank0]:     return fn.__func__(fn.__self__, *args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1746, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tfogal/dev/thunder/thunder/core/interpreter.py", line 6379, in _impl
[rank0]:     return fn.__func__(fn.__self__, *args, **kwargs)
[rank0]: TypeError: <function ResBlock.forward at 0x7ec40ef367a0>() is missing 1 required positional arguments: emb

tfogal avatar Aug 07 '24 17:08 tfogal