🐛[BUG]: RuntimeError: Input type (c10::Half) and bias type (float) should be the same
Version
Latest from main branch
On which installation method(s) does this occur?
Source
Describe the issue
Following this PR, the CorrDiff example has an error in the generation (see traceback below).
The weights for both regression and diffusion are new following this PR too.
[2025-05-06 17:09:31,605][generate][INFO] - Using dataset: hrrr_mini
[2025-05-06 17:09:48,205][generate][INFO] - Patch-based training disabled
[2025-05-06 17:09:48,205][generate][INFO] - Loading residual network from "/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_diffusion_checkpoint_path/EDMPrecondSuperResolution.0.8000000.mdlus"...
[2025-05-06 17:09:49,114][generate][INFO] - Loading network from "/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_regression_checkpoint_path/UNet.0.2000128.mdlus"...
[2025-05-06 17:09:49,426][generate][INFO] - Generating images, saving results to /mnt/azureml/cr/j/.../cap/data-capability/wd/output_filename/sample.nc...
[2025-05-06 17:09:50,195][generate][INFO] - starting index: 0
/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/layers.py:701: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with amp.autocast(enabled=self.amp_mode):
Error executing job with overrides: ['++dataset.data_path=/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_data_path/hrrr_mini_train.nc', '++dataset.stats_path=/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_stats_path/stats.json', '++generation.io.reg_ckpt_filename=/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_regression_checkpoint_path/UNet.0.2000128.mdlus', '++generation.io.res_ckpt_filename=/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_diffusion_checkpoint_path/EDMPrecondSuperResolution.0.8000000.mdlus', '++generation.io.output_filename=/mnt/azureml/cr/j/.../cap/data-capability/wd/output_filename/sample.nc']
Traceback (most recent call last):
File "/mnt/azureml/cr/j/.../exe/wd/generate.py", line 390, in <module>
main()
File "/usr/local/lib/python3.12/dist-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
^^^^^^
File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
lambda: hydra.run(
^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/usr/local/lib/python3.12/dist-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/azureml/cr/j/.../exe/wd/generate.py", line 344, in main
image_out = generate_fn()
^^^^^^^^^^^^^
File "/mnt/azureml/cr/j/.../exe/wd/generate.py", line 192, in generate_fn
image_reg = regression_step(
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/utils/corrdiff/utils.py", line 84, in regression_step
x = net(x=x_hat[0:1], img_lr=img_lr)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/unet.py", line 165, in forward
F_x = self.model(
^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/song_unet.py", line 703, in forward
return super().forward(x, noise_labels, class_labels, augment_labels)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/song_unet.py", line 450, in forward
x = block(x, emb)
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/layers.py", line 703, in forward
x = self.proj(attn.reshape(*x.shape)).add_(x)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/layers.py", line 285, in forward
x = torch.nn.functional.conv2d(x, w, padding=w_pad, bias=b)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Input type (c10::Half) and bias type (float) should be the same
Minimum reproducible example
Default CorrDiff example
Remove with amp.autocast(enabled=self.amp_mode): in layers.py can solve this problem. But maybe not the best solution.
@luke-conibear thank you for reporting. We are aware of issues with CorrDiff checkpoints, and those will be addressed by #871 once it is merged. For the time being, you can downgrade to the last release 1.0.1-rc until we have a fix.
@loliverhennigh @jialusui1102 for viz
@luke-conibear after discussion with @jialusui1102 it seems your problem is not due to checkpoints (downgrading to 1.0.1-rc should still fix your issue until we resolve this).
Could please detail how you generated the checkpoints that you want to use in generate.py? Are they trained with the latest train.py and which config file did you use?
Could you also confirm that you are using this config for the generate.py, or if you modifed anything there?
@CharlelieLrt this is not using old checkpoints. It is all new runs and checkpoints for all steps.
Yes, I used that exact config for generation. I used the default configs from the main branch without any changes.
config_training_hrrr_mini_regression.yamlconfig_training_hrrr_mini_diffusion.yamlconfig_generate_hrrr_mini.yaml
The exact commands I submitted were:
- Regression
python train.py --config-name=config_training_hrrr_mini_regression.yaml ++dataset.data_path=${{inputs.data_path}} ++dataset.stats_path=${{inputs.stats_path}} ++training.hp.total_batch_size=256 ++training.hp.batch_size_per_gpu=64 ++training.perf.dataloader_workers=1 ++training.io.checkpoint_dir=${{outputs.checkpoint_dir}} ++hydra.run.dir=${{outputs.output_dir}}
- Diffusion
python train.py --config-name=config_training_hrrr_mini_diffusion.yaml ++dataset.data_path=${{inputs.data_path}} ++dataset.stats_path=${{inputs.stats_path}} ++training.hp.total_batch_size=256 ++training.hp.batch_size_per_gpu=64 ++training.perf.dataloader_workers=1 ++training.io.regression_checkpoint_path=${{inputs.regression_checkpoint_path}} ++training.io.checkpoint_dir=${{outputs.checkpoint_dir}} ++hydra.run.dir=${{outputs.output_dir}}
- Generation
python generate.py --config-name=config_generate_hrrr_mini.yaml ++dataset.data_path=${{inputs.data_path}} ++dataset.stats_path=${{inputs.stats_path}} ++generation.io.reg_ckpt_filename=${{inputs.regression_checkpoint_path}} ++generation.io.res_ckpt_filename=${{inputs.diffusion_checkpoint_path}} ++generation.io.output_filename=${{outputs.output_filename}} ++hydra.run.dir=${{outputs.output_dir}}
Regression run okay in the same time as before the PR. Diffusion runs, though the non-patched version now takes double the time to complete. Generation has the error above.
@luke-conibear thank you for the details.
Generation has the error above.
This was due to keeping AMP enabled in inference, which shouldn't be the case. It should be fixed in #882. Let me know if you still encounter this issue.
Diffusion runs, though the non-patched version now takes double the time to complete.
We were not able to reproduce this. At least the runtime per forward pass that we measured during training is consistent with both the regression model (since both regression and diffusion models share the same architecture, their forward pass runtimes should be comparable), and the diffusion model pre-PR.
Could you please share these details:
- Are you referring to overall runtime or runtime per iteration, or only the forward pass?
- Which commit are you using as a reference in your "double the time" comparison?
- Are you using DDP, and if so how many GPUs are you using?
@CharlelieLrt Thanks for the quick response.
Unfortunately, yes the generation RuntimeError issue is still there.
For the timing comment, I was confused in my comparisons. Sorry for wasting time there.
My mistake was that the previous run used 4 GPUs, while the new run used 2 GPUs. So double the time for half the GPUs makes sense.
@luke-conibear I am not able to reproduce the RuntimeError with the latest commit. To help me troubleshoot this, could you please:
- Give me commit hash that you are using for the entire pipeline (i.e. regression training, diffusion training, and generate). Please make sure that you use the same commit for all of them.
- Give me the command that you use to run the
train.pyfor both regression and diffusion training, including the config and any hydra override. - The number and type of GPUs that you are using for training regression and diffusion.
- That command that you use to run the
generate.py, including the config and any hydra override. - The number and types of GPUs that you are using for generation.
Thanks for your help
- I used this recent commit for all steps
- Commands
# Regression torchrun --standalone --nnodes=1 --nproc_per_node=2 train.py --config-name=config_training_hrrr_mini_regression.yaml model=regression ++dataset.data_path=${{inputs.data_path}} ++dataset.stats_path=${{inputs.stats_path}} ++training.hp.total_batch_size=2560 ++training.hp.batch_size_per_gpu=640 ++training.perf.dataloader_workers=1 ++training.io.checkpoint_dir=${{outputs.checkpoint_dir}} ++hydra.run.dir=${{outputs.output_dir}} # Diffusion torchrun --standalone --nnodes=1 --nproc_per_node=2 train.py --config-name=config_training_hrrr_mini_diffusion.yaml model=diffusion ++dataset.data_path=${{inputs.data_path}} ++dataset.stats_path=${{inputs.stats_path}} ++training.hp.total_batch_size=2560 ++training.hp.batch_size_per_gpu=640 ++training.perf.dataloader_workers=1 ++training.io.regression_checkpoint_path=${{inputs.regression_checkpoint_path}} ++training.io.checkpoint_dir=${{outputs.checkpoint_dir}} ++hydra.run.dir=${{outputs.output_dir}} # Generation python generate.py --config-name=config_generate_hrrr_mini.yaml generation=non_patched ++dataset.data_path=${{inputs.data_path}} ++dataset.stats_path=${{inputs.stats_path}} ++generation.io.reg_ckpt_filename=${{inputs.regression_checkpoint_path}} ++generation.io.res_ckpt_filename=${{inputs.diffusion_checkpoint_path}} ++generation.io.output_filename=${{outputs.output_filename}} ++hydra.run.dir=${{outputs.output_dir}} ++generation.has_lead_time=False ++generation.num_ensembles=2 ++generation.times=['2020-02-02T00:00:00'] - Configs are default ones without any changes
- All steps use Standard_NC80adis_H100_v5 on Azure ML. 2x GPUs for regression and diffusion. 1x GPU for generation.
The above information is for non-patched diffusion, as I cannot get the patched version to work.
I've tried many config/hydra variants e.g., appending to the command
f"model=patched_diffusion ++training.hp.patch_shape_x={patch_shape_x} ++training.hp.patch_shape_y={patch_shape_y} ++training.hp.patch_num={patch_num} "
Though always get in the logs
Patch-based training disabled
@luke-conibear thank you for the details, we will try to reproduce your error with the generate.py.
The above information is for non-patched diffusion, as I cannot get the patched version to work.
I've tried the exact command that you provided with the commit that you linked and the patch-based diffusion training works without problem for me. What values did you use for patch_shape_x and patch_shape_y? I suspect that you used values >= 64? FYI, the HRRR-mini dataset has images that are 64x64, so if you request patches that are greater or equal than 64, the patched training will be automatically disabled.
Note 1: currently patch_shape_x and patch_shape_y also need to be multiple of 32, so the only option to have patch-based diffusion training on HRRR-mini is to set patch_shape_x = 32 and patch_shape_y = 32
Note 2: patched-based training is designed for much larger images. It should still work on the HRRR-mini dataset, but it is not the most relevant application of patch-based diffusion.
@CharlelieLrt Okay, great, thanks a lot for the help.
Yes, you're right about the patch shape. I used 32 and patched diffusion works.
Then generation for patched diffusion has the same error as for non-patched. Traceback below:
/usr/local/lib/python3.12/dist-packages/physicsnemo/utils/filesystem.py:75: SyntaxWarning: invalid escape sequence '\w'
pattern = re.compile(f"{suffix}[\w-]+(/[\w-]+)?/[\w-]+@[A-Za-z0-9.]+/[\w/](.*)")
/usr/local/lib/python3.12/dist-packages/physicsnemo/launch/logging/launch.py:321: SyntaxWarning: invalid escape sequence '\.'
key = re.sub("[^a-zA-Z0-9\.\-\s\/\_]+", "", key)
/usr/local/lib/python3.12/dist-packages/physicsnemo/utils/generative/deterministic_sampler.py:53: SyntaxWarning: invalid escape sequence '\s'
"""
/usr/local/lib/python3.12/dist-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'config_generate_hrrr_mini.yaml': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information
warnings.warn(msg, UserWarning)
/usr/local/lib/python3.12/dist-packages/physicsnemo/distributed/manager.py:415: UserWarning: Could not initialize using ENV, SLURM or OPENMPI methods. Assuming this is a single process job
warn(
[2025-05-14 14:05:47,457][generate][INFO] - Using dataset: hrrr_mini
[2025-05-14 14:06:04,172][generate][INFO] - Patch-based training enabled
[2025-05-14 14:06:04,172][generate][INFO] - Loading residual network from "/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_diffusion_checkpoint_path/EDMPrecondSuperResolution.0.8000000.mdlus"...
[2025-05-14 14:06:04,955][generate][INFO] - Loading network from "/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_regression_checkpoint_path/UNet.0.2001920.mdlus"...
[2025-05-14 14:06:05,240][generate][INFO] - Generating images, saving results to /mnt/azureml/cr/j/.../cap/data-capability/wd/output_filename/sample.nc...
[2025-05-14 14:06:06,021][generate][INFO] - starting index: 0
/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/layers.py:701: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with amp.autocast(enabled=self.amp_mode):
Error executing job with overrides: ['generation=patched', '++dataset.data_path=/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_data_path/hrrr_mini_train.nc', '++dataset.stats_path=/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_stats_path/stats.json', '++generation.io.reg_ckpt_filename=/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_regression_checkpoint_path/UNet.0.2001920.mdlus', '++generation.io.res_ckpt_filename=/mnt/azureml/cr/j/.../cap/data-capability/wd/INPUT_diffusion_checkpoint_path/EDMPrecondSuperResolution.0.8000000.mdlus', '++generation.io.output_filename=/mnt/azureml/cr/j/.../cap/data-capability/wd/output_filename/sample.nc', '++generation.has_lead_time=False', '++generation.num_ensembles=2', '++generation.times=[2020-02-02T00:00:00]', '++generation.patch_shape_x=32', '++generation.patch_shape_y=32']
Traceback (most recent call last):
File "/mnt/azureml/cr/j/.../exe/wd/generate.py", line 396, in <module>
main()
File "/usr/local/lib/python3.12/dist-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
^^^^^^
File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
lambda: hydra.run(
^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/usr/local/lib/python3.12/dist-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/azureml/cr/j/.../exe/wd/generate.py", line 350, in main
image_out = generate_fn()
^^^^^^^^^^^^^
File "/mnt/azureml/cr/j/.../exe/wd/generate.py", line 198, in generate_fn
image_reg = regression_step(
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/utils/corrdiff/utils.py", line 84, in regression_step
x = net(x=x_hat[0:1], img_lr=img_lr)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/unet.py", line 165, in forward
F_x = self.model(
^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/song_unet.py", line 703, in forward
return super().forward(x, noise_labels, class_labels, augment_labels)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/song_unet.py", line 450, in forward
x = block(x, emb)
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/layers.py", line 703, in forward
x = self.proj(attn.reshape(*x.shape)).add_(x)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion/layers.py", line 285, in forward
x = torch.nn.functional.conv2d(x, w, padding=w_pad, bias=b)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Input type (c10::Half) and bias type (float) should be the same
Yes, you're right about the patch shape. I used 32 and patched diffusion works.
Great to know! We will update the log messages to more clearly explain why patching is disabled in this case.
Regarding your runtime error in generate.py @jialusui1102 identified the source of the problem (we were not properly disabling AMP in the models).
Both will be fixed once #885 is merged.
@CharlelieLrt Thanks a lot for the great help here. I confirm this is fixed.