Flux.2 with Lora error: "mul_cuda" not implemented for 'Float8_e4m3fn'
Custom Node Testing
- [ ] I have tried disabling custom nodes and the issue persists (see how to disable custom nodes if you need help)
Expected Behavior
The image should generate using the lora
Actual Behavior
When adding a "Lora loader only" after the Load Diffusion model, i get this error: "mul_cuda" not implemented for 'Float8_e4m3fn'
Steps to Reproduce
Using this workflow works: https://comfyanonymous.github.io/ComfyUI_examples/flux2/
Adding the Lora Model Loader only after the load model diffuser to that workflow give the error
Debug Logs
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
[MultiGPU Core Patching] text_encoder_device_patched returning device: cuda:0 (current_text_encoder_device=cuda:0)
Using MixedPrecisionOps for text encoder: 210 quantized layers
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
loaded completely; 30385.05 MB usable, 17180.59 MB loaded, full load: True
Found quantization metadata (version 1.0)
Detected mixed precision quantization: 128 layers quantized
Using mixed precision operations: 128 quantized layers
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLUX
Warning: TAESD previews enabled, but could not find models/vae_approx/None
Requested to load Flux2
QuantizedTensor: Unhandled operation aten.add_.Tensor, falling back to dequantization. kwargs={}
ERROR lora diffusion_model.single_blocks.9.linear1.weight Promotion for Float8 Types is not supported, attempted to promote BFloat16 and Float8_e4m3fn
QuantizedTensor: Unhandled operation aten.slice.Tensor, falling back to dequantization. kwargs={}
!!! Exception during processing !!! "mul_cuda" not implemented for 'Float8_e4m3fn'
Traceback (most recent call last):
File "/home/ubuntuai/ComfyUI/execution.py", line 510, in execute
output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
File "/home/ubuntuai/ComfyUI/execution.py", line 324, in get_output_data
return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
File "/home/ubuntuai/ComfyUI/execution.py", line 298, in _async_map_node_over_list
await process_inputs(input_dict, i)
File "/home/ubuntuai/ComfyUI/execution.py", line 286, in process_inputs
result = f(**inputs)
File "/home/ubuntuai/ComfyUI/comfy_extras/nodes_custom_sampler.py", line 835, in sample
samples = guider.sample(noise.generate_noise(latent), latent_image, sampler, sigmas, denoise_mask=noise_mask, callback=callback, disable_pbar=disable_pbar, seed=noise.seed)
File "/home/ubuntuai/ComfyUI/comfy/samplers.py", line 1035, in sample
output = executor.execute(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed, latent_shapes=latent_shapes)
File "/home/ubuntuai/ComfyUI/comfy/patcher_extension.py", line 112, in execute
return self.original(*args, **kwargs)
File "/home/ubuntuai/ComfyUI/comfy/samplers.py", line 984, in outer_sample
self.inner_model, self.conds, self.loaded_models = comfy.sampler_helpers.prepare_sampling(self.model_patcher, noise.shape, self.conds, self.model_options)
File "/home/ubuntuai/ComfyUI/comfy/sampler_helpers.py", line 130, in prepare_sampling
return executor.execute(model, noise_shape, conds, model_options=model_options)
File "/home/ubuntuai/ComfyUI/comfy/patcher_extension.py", line 112, in execute
return self.original(*args, **kwargs)
File "/home/ubuntuai/ComfyUI/comfy/sampler_helpers.py", line 138, in _prepare_sampling
comfy.model_management.load_models_gpu([model] + models, memory_required=memory_required + inference_memory, minimum_memory_required=minimum_memory_required + inference_memory)
File "/home/ubuntuai/ComfyUI/comfy/model_management.py", line 701, in load_models_gpu
loaded_model.model_load(lowvram_model_memory, force_patch_weights=force_patch_weights)
File "/home/ubuntuai/ComfyUI/comfy/model_management.py", line 506, in model_load
self.model_use_more_vram(use_more_vram, force_patch_weights=force_patch_weights)
File "/home/ubuntuai/ComfyUI/comfy/model_management.py", line 536, in model_use_more_vram
return self.model.partially_load(self.device, extra_memory, force_patch_weights=force_patch_weights)
File "/home/ubuntuai/ComfyUI/comfy/model_patcher.py", line 944, in partially_load
raise e
File "/home/ubuntuai/ComfyUI/comfy/model_patcher.py", line 941, in partially_load
self.load(device_to, lowvram_model_memory=current_used + extra_memory, force_patch_weights=force_patch_weights, full_load=full_load)
File "/home/ubuntuai/ComfyUI/comfy/model_patcher.py", line 754, in load
self.patch_weight_to_device(key, device_to=device_to)
File "/home/ubuntuai/ComfyUI/comfy/model_patcher.py", line 630, in patch_weight_to_device
out_weight = comfy.float.stochastic_rounding(out_weight, weight.dtype, seed=string_to_seed(key))
File "/home/ubuntuai/ComfyUI/comfy/float.py", line 64, in stochastic_rounding
output[i:i+slice_size].copy_(manual_stochastic_round_to_float8(value[i:i+slice_size], dtype, generator=generator))
File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 216, in __torch_dispatch__
return cls._dequant_and_fallback(func, args, kwargs)
File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 227, in _dequant_and_fallback
new_args = dequant_arg(args)
File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 224, in dequant_arg
return type(arg)(dequant_arg(a) for a in arg)
File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 224, in <genexpr>
return type(arg)(dequant_arg(a) for a in arg)
File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 222, in dequant_arg
return arg.dequantize()
File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 196, in dequantize
return LAYOUTS[self._layout_type].dequantize(self._qdata, **self._layout_params)
File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 421, in dequantize
return plain_tensor * scale
RuntimeError: "mul_cuda" not implemented for 'Float8_e4m3fn'
Other
Lora made with ai-toolkit. The generated samples with ai toolkit work fine.
Tested with a 5090 and a rtx pro 6000 using Driver Version: 575.57.08 CUDA Version: 12.9 Torch version in the python venv: torch 2.8.0.dev20250415+cu128
I have the same problem and would like to ask for help.
Update, seems like after https://github.com/comfyanonymous/ComfyUI/pull/10899 something has been fixed but now I get another error:
Expected size for first two dimensions of batch2 tensor to be: [64, 128] but got: [64, 32].
But disabling the preview fixes this error. Tested on both fp8 and full model
Update, seems like after #10899 something has been fixed but now I get another error:
Expected size for first two dimensions of batch2 tensor to be: [64, 128] but got: [64, 32].But disabling the preview fixes this error. Tested on both fp8 and full model
@RodriMora I just found out that this error is caused by the VHS animated preview during sampling. If you disable it, it works fine
Update, seems like after #10899 something has been fixed but now I get another error:
Expected size for first two dimensions of batch2 tensor to be: [64, 128] but got: [64, 32].But disabling the preview fixes this error. Tested on both fp8 and full model@RodriMora I just found out that this error is caused by the VHS animated preview during sampling. If you disable it, it works fine
Thank you for posting this solution! Worked for me after getting that same error.
@Kosinkadink @AustinMroz
I pushed a fix for this last Wednesday. Are you still seeing the issue with the latest version of VHS?