ComfyUI Flux.2 with Lora error: "mul_cuda" not implemented for 'Float8

Custom Node Testing

[ ] I have tried disabling custom nodes and the issue persists (see how to disable custom nodes if you need help)

Expected Behavior

The image should generate using the lora

Actual Behavior

When adding a "Lora loader only" after the Load Diffusion model, i get this error: "mul_cuda" not implemented for 'Float8_e4m3fn'

Steps to Reproduce

Using this workflow works: https://comfyanonymous.github.io/ComfyUI_examples/flux2/

Adding the Lora Model Loader only after the load model diffuser to that workflow give the error

Debug Logs

Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
[MultiGPU Core Patching] text_encoder_device_patched returning device: cuda:0 (current_text_encoder_device=cuda:0)
Using MixedPrecisionOps for text encoder: 210 quantized layers
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
loaded completely; 30385.05 MB usable, 17180.59 MB loaded, full load: True
Found quantization metadata (version 1.0)
Detected mixed precision quantization: 128 layers quantized
Using mixed precision operations: 128 quantized layers
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLUX
Warning: TAESD previews enabled, but could not find models/vae_approx/None
Requested to load Flux2
QuantizedTensor: Unhandled operation aten.add_.Tensor, falling back to dequantization. kwargs={}
ERROR lora diffusion_model.single_blocks.9.linear1.weight Promotion for Float8 Types is not supported, attempted to promote BFloat16 and Float8_e4m3fn
QuantizedTensor: Unhandled operation aten.slice.Tensor, falling back to dequantization. kwargs={}
!!! Exception during processing !!! "mul_cuda" not implemented for 'Float8_e4m3fn'
Traceback (most recent call last):
  File "/home/ubuntuai/ComfyUI/execution.py", line 510, in execute
    output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
  File "/home/ubuntuai/ComfyUI/execution.py", line 324, in get_output_data
    return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
  File "/home/ubuntuai/ComfyUI/execution.py", line 298, in _async_map_node_over_list
    await process_inputs(input_dict, i)
  File "/home/ubuntuai/ComfyUI/execution.py", line 286, in process_inputs
    result = f(**inputs)
  File "/home/ubuntuai/ComfyUI/comfy_extras/nodes_custom_sampler.py", line 835, in sample
    samples = guider.sample(noise.generate_noise(latent), latent_image, sampler, sigmas, denoise_mask=noise_mask, callback=callback, disable_pbar=disable_pbar, seed=noise.seed)
  File "/home/ubuntuai/ComfyUI/comfy/samplers.py", line 1035, in sample
    output = executor.execute(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed, latent_shapes=latent_shapes)
  File "/home/ubuntuai/ComfyUI/comfy/patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
  File "/home/ubuntuai/ComfyUI/comfy/samplers.py", line 984, in outer_sample
    self.inner_model, self.conds, self.loaded_models = comfy.sampler_helpers.prepare_sampling(self.model_patcher, noise.shape, self.conds, self.model_options)
  File "/home/ubuntuai/ComfyUI/comfy/sampler_helpers.py", line 130, in prepare_sampling
    return executor.execute(model, noise_shape, conds, model_options=model_options)
  File "/home/ubuntuai/ComfyUI/comfy/patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
  File "/home/ubuntuai/ComfyUI/comfy/sampler_helpers.py", line 138, in _prepare_sampling
    comfy.model_management.load_models_gpu([model] + models, memory_required=memory_required + inference_memory, minimum_memory_required=minimum_memory_required + inference_memory)
  File "/home/ubuntuai/ComfyUI/comfy/model_management.py", line 701, in load_models_gpu
    loaded_model.model_load(lowvram_model_memory, force_patch_weights=force_patch_weights)
  File "/home/ubuntuai/ComfyUI/comfy/model_management.py", line 506, in model_load
    self.model_use_more_vram(use_more_vram, force_patch_weights=force_patch_weights)
  File "/home/ubuntuai/ComfyUI/comfy/model_management.py", line 536, in model_use_more_vram
    return self.model.partially_load(self.device, extra_memory, force_patch_weights=force_patch_weights)
  File "/home/ubuntuai/ComfyUI/comfy/model_patcher.py", line 944, in partially_load
    raise e
  File "/home/ubuntuai/ComfyUI/comfy/model_patcher.py", line 941, in partially_load
    self.load(device_to, lowvram_model_memory=current_used + extra_memory, force_patch_weights=force_patch_weights, full_load=full_load)
  File "/home/ubuntuai/ComfyUI/comfy/model_patcher.py", line 754, in load
    self.patch_weight_to_device(key, device_to=device_to)
  File "/home/ubuntuai/ComfyUI/comfy/model_patcher.py", line 630, in patch_weight_to_device
    out_weight = comfy.float.stochastic_rounding(out_weight, weight.dtype, seed=string_to_seed(key))
  File "/home/ubuntuai/ComfyUI/comfy/float.py", line 64, in stochastic_rounding
    output[i:i+slice_size].copy_(manual_stochastic_round_to_float8(value[i:i+slice_size], dtype, generator=generator))
  File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 216, in __torch_dispatch__
    return cls._dequant_and_fallback(func, args, kwargs)
  File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 227, in _dequant_and_fallback
    new_args = dequant_arg(args)
  File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 224, in dequant_arg
    return type(arg)(dequant_arg(a) for a in arg)
  File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 224, in <genexpr>
    return type(arg)(dequant_arg(a) for a in arg)
  File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 222, in dequant_arg
    return arg.dequantize()
  File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 196, in dequantize
    return LAYOUTS[self._layout_type].dequantize(self._qdata, **self._layout_params)
  File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 421, in dequantize
    return plain_tensor * scale
RuntimeError: "mul_cuda" not implemented for 'Float8_e4m3fn'

Other

Lora made with ai-toolkit. The generated samples with ai toolkit work fine.

Tested with a 5090 and a rtx pro 6000 using Driver Version: 575.57.08 CUDA Version: 12.9 Torch version in the python venv: torch 2.8.0.dev20250415+cu128

Nov 26 '25 09:11 RodriMora

I have the same problem and would like to ask for help.

Nov 26 '25 09:11 hl2dm

Update, seems like after https://github.com/comfyanonymous/ComfyUI/pull/10899 something has been fixed but now I get another error:

Expected size for first two dimensions of batch2 tensor to be: [64, 128] but got: [64, 32].

But disabling the preview fixes this error. Tested on both fp8 and full model

Nov 26 '25 13:11 RodriMora

Update, seems like after #10899 something has been fixed but now I get another error:

Expected size for first two dimensions of batch2 tensor to be: [64, 128] but got: [64, 32].

But disabling the preview fixes this error. Tested on both fp8 and full model

@RodriMora I just found out that this error is caused by the VHS animated preview during sampling. If you disable it, it works fine

Nov 26 '25 17:11 LukeG89

Update, seems like after #10899 something has been fixed but now I get another error: Expected size for first two dimensions of batch2 tensor to be: [64, 128] but got: [64, 32]. But disabling the preview fixes this error. Tested on both fp8 and full model

@RodriMora I just found out that this error is caused by the VHS animated preview during sampling. If you disable it, it works fine

Thank you for posting this solution! Worked for me after getting that same error.

@Kosinkadink @AustinMroz

Dec 04 '25 03:12 CCpt5

I pushed a fix for this last Wednesday. Are you still seeing the issue with the latest version of VHS?

Dec 04 '25 03:12 AustinMroz

Flux.2 with Lora error: "mul_cuda" not implemented for 'Float8_e4m3fn'

Custom Node Testing

Expected Behavior

Actual Behavior

Steps to Reproduce

Debug Logs

Other