CUDA error: invalid argument
Custom Node Testing
- [ ] I have tried disabling custom nodes and the issue persists (see how to disable custom nodes if you need help)
Your question
When I used PuLID (PuLID_Flux_II), this error occurred.
However, this phenomenon isn't reliably reproducible, which is causing me trouble.
- Sometimes image generation proceeds normally without the error occurring
- Sometimes, despite the error appearing, repeated attempts eventually succeed in generating an image
- Sometimes disabling the node allows normal image generation
- Sometimes disabling the node does not resolve the issue
If this is a clear bug, I would appreciate it being fixed promptly. However, if there is a way I can resolve this myself without waiting for a fix, please do let me know.
The operating environment is as follows: Python: 3.10.11 (within Stability Marix) Pytorch version: 2.9.1+cu128 (even when I set it to cu130, it would revert to cu128 for some reason upon update) OS: Windows 10 VRAM: 12GB DRAM: 64GB
Logs
!!! Exception during processing !!! CUDA error: invalid argument
Search for `cudaErrorInvalidValue' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Traceback (most recent call last):
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\execution.py", line 515, in execute
output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, v3_data=v3_data)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\execution.py", line 329, in get_output_data
return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, v3_data=v3_data)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\execution.py", line 303, in _async_map_node_over_list
await process_inputs(input_dict, i)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\execution.py", line 291, in process_inputs
result = f(**inputs)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\nodes.py", line 1538, in sample
return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\nodes.py", line 1505, in common_ksampler
samples = comfy.sample.sample(model, noise, steps, cfg, sampler_name, scheduler, positive, negative, latent_image,
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\sample.py", line 60, in sample
samples = sampler.sample(noise, positive, negative, cfg=cfg, latent_image=latent_image, start_step=start_step, last_step=last_step, force_full_denoise=force_full_denoise, denoise_mask=noise_mask, sigmas=sigmas, callback=callback, disable_pbar=disable_pbar, seed=seed)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\samplers.py", line 1163, in sample
return sample(self.model, noise, positive, negative, cfg, self.device, sampler, sigmas, self.model_options, latent_image=latent_image, denoise_mask=denoise_mask, callback=callback, disable_pbar=disable_pbar, seed=seed)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\samplers.py", line 1053, in sample
return cfg_guider.sample(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\samplers.py", line 1035, in sample
output = executor.execute(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed, latent_shapes=latent_shapes)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\patcher_extension.py", line 113, in execute
return self.wrappers[self.idx](self, *args, **kwargs)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\custom_nodes\ComfyUI_PuLID_Flux_ll\pulidflux.py", line 625, in pulid_outer_sample_wrappers_with_override
out = wrapper_executor(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed, **kwargs)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\patcher_extension.py", line 105, in __call__
return new_executor.execute(*args, **kwargs)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\patcher_extension.py", line 112, in execute
return self.original(*args, **kwargs)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\samplers.py", line 984, in outer_sample
self.inner_model, self.conds, self.loaded_models = comfy.sampler_helpers.prepare_sampling(self.model_patcher, noise.shape, self.conds, self.model_options)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\sampler_helpers.py", line 130, in prepare_sampling
return executor.execute(model, noise_shape, conds, model_options=model_options)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\patcher_extension.py", line 112, in execute
return self.original(*args, **kwargs)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\sampler_helpers.py", line 138, in _prepare_sampling
comfy.model_management.load_models_gpu([model] + models, memory_required=memory_required + inference_memory, minimum_memory_required=minimum_memory_required + inference_memory)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\model_management.py", line 701, in load_models_gpu
loaded_model.model_load(lowvram_model_memory, force_patch_weights=force_patch_weights)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\model_management.py", line 506, in model_load
self.model_use_more_vram(use_more_vram, force_patch_weights=force_patch_weights)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\model_management.py", line 536, in model_use_more_vram
return self.model.partially_load(self.device, extra_memory, force_patch_weights=force_patch_weights)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\model_patcher.py", line 952, in partially_load
self.partially_unload(self.offload_device, -extra_memory, force_patch_weights=force_patch_weights)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\model_patcher.py", line 901, in partially_unload
m.to(device_to)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\venv\lib\site-packages\torch\nn\modules\module.py", line 1371, in to
return self._apply(convert)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\ops.py", line 639, in _apply
self.register_parameter(key, torch.nn.Parameter(fn(param), requires_grad=False))
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\venv\lib\site-packages\torch\nn\modules\module.py", line 1357, in convert
return t.to(
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\quant_ops.py", line 205, in __torch_dispatch__
return _GENERIC_UTILS[func](func, args, kwargs)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\quant_ops.py", line 321, in generic_to_dtype_layout
return _handle_device_transfer(
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\quant_ops.py", line 272, in _handle_device_transfer
new_q_data = qt._qdata.to(device=target_device)
torch.AcceleratorError: CUDA error: invalid argument
Search for `cudaErrorInvalidValue' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Other
No response
Cuda einval on partially unload can be a product of using an old and incompatible version of GGUF loaders. --disable-pinned-memory works around but can cost you a lot of performance. Make sure you are on the latest version of ComfyUi-GGUF if doing GGUF anywhere in the flow.
If it's not GGUF we need full information on the bug as these are hard to pinpoint. A full workflow and paste the entire log at least starting from "got prompt".
Not the OP but I had the same error. I have been unable to reproduce the error using a different GGUF Loader, in my case the Unet Loader (GGUF) custom node works just fine.
The problem loader is: name: loadergguf version: 2.6.5 cnr_id: gguf ue properties says: {"widget_ue_connectable":{},"input_ue_unconnectable":{},"version":"7.4.1"}
Is there a more detailed id I can report?
Update. The default loader that appears highlighted at the top of the list when clicking and searching for GGUF Loaders, immediately returned an error too. Version 2.7.5 looks like the default?
Cuda einval on partially unload can be a product of using an old and incompatible version of GGUF loaders. --disable-pinned-memory works around but can cost you a lot of performance. Make sure you are on the latest version of ComfyUi-GGUF if doing GGUF anywhere in the flow.
If it's not GGUF we need full information on the bug as these are hard to pinpoint. A full workflow and paste the entire log at least starting from "got prompt".
Thank you. And I apologise for the lack of explanation. The error occurs when processing moves from sampling to VAE decoding. That is, the sampling itself completes normally even when the error occurs. Furthermore, in the current situation, once an error occurs, it persists unless Comfy is restarted; disabling the node or change the model does not resolve it. Also, gguf is not being used.
In this test, after restarting Comfy following an error, it none triggered the error again, and time constraints prevented thorough testing. However, it might be causing instability where the error occurs or completes normally depending on memory usage.
In this test, the error occurred with the combination CFG=1, Guidance=3.5, euler + simple, with Negative Prompt. Yet, even with the same combination after restarting, it failed to reproduce and completed normally. Furthermore, whether using CFG>1.0 (with Negative Prompt) or CFG=1 (without Negative Prompt), errors sometimes occur and sometimes do not in either case. The model used in this test was the FP8 Full Model(*), but during normal usage, there is no particular difference in error occurrence whether using FP16 or FP8, or whether using the Full Model or the Pruned Model.
- https://civitai.com/models/1032613?modelVersionId=1158144
[On Error]
got prompt 10:39:48
Requested to load Flux
loaded partially; 8480.79 MB usable, 8372.35 MB loaded, 2980.68 MB offloaded, 108.02 MB buffer reserved, lowvram patches: 0
Unloaded partially: 765.10 MB freed, 324.06 MB remains loaded, 36.00 MB buffer reserved, lowvram patches: 0
100%|██████████| 30/30 [02:12<00:00, 4.42s/it]
Requested to load AutoencodingEngine
!!! Exception during processing !!! CUDA error: invalid argument
Search for cudaErrorInvalidValue' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.
Traceback (most recent call last):
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\execution.py", line 515, in execute
output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, v3_data=v3_data)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\execution.py", line 329, in get_output_data
return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, v3_data=v3_data)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\execution.py", line 303, in _async_map_node_over_list
await process_inputs(input_dict, i)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\execution.py", line 291, in process_inputs
result = f(**inputs)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\nodes.py", line 298, in decode
images = vae.decode(samples["samples"])
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\sd.py", line 774, in decode
model_management.load_models_gpu([self.patcher], memory_required=memory_used, force_full_load=self.disable_offload)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\model_management.py", line 671, in load_models_gpu
free_memory(total_memory_required[device] * 1.1 + extra_mem, device)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\model_management.py", line 603, in free_memory
if current_loaded_models[i].model_unload(memory_to_free):
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\model_management.py", line 526, in model_unload
freed = self.model.partially_unload(self.model.offload_device, memory_to_free)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\model_patcher.py", line 904, in partially_unload
m.to(device_to)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\venv\lib\site-packages\torch\nn\modules\module.py", line 1371, in to
return self._apply(convert)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\ops.py", line 639, in _apply
self.register_parameter(key, torch.nn.Parameter(fn(param), requires_grad=False))
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\venv\lib\site-packages\torch\nn\modules\module.py", line 1357, in convert
return t.to(
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\quant_ops.py", line 205, in torch_dispatch
return _GENERIC_UTILS[func](func, args, kwargs)
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\quant_ops.py", line 321, in generic_to_dtype_layout
return _handle_device_transfer(
File "Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\comfy\quant_ops.py", line 272, in _handle_device_transfer
new_q_data = qt._qdata.to(device=target_device)
torch.AcceleratorError: CUDA error: invalid argument
Search for cudaErrorInvalidValue' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.
[On success-the first time]
got prompt 10:57:52 Using pytorch attention in VAE Using pytorch attention in VAE VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16 [MultiGPU Core Patching] text_encoder_device_patched returning device: cpu (current_text_encoder_device=cpu) Requested to load FluxClipModel_ loaded completely; 95367431640625005117571072.00 MB usable, 5013.38 MB loaded, full load: True CLIP/text encoder model load device: cpu, offload device: cpu, current: cpu, dtype: torch.float32 Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}} find model: Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\models\insightface\models\antelopev2\1k3d68.onnx landmark_3d_68 ['None', 3, 192, 192] 0.0 1.0 Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}} find model: Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\models\insightface\models\antelopev2\2d106det.onnx landmark_2d_106 ['None', 3, 192, 192] 0.0 1.0 Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}} find model: Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\models\insightface\models\antelopev2\genderage.onnx genderage ['None', 3, 96, 96] 0.0 1.0 Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}} find model: Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\models\insightface\models\antelopev2\glintr100.onnx recognition ['None', 3, 112, 112] 127.5 127.5 Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}} find model: Z:\WorkSpace\StabilityMatrix\Data\Packages\ComfyUI\models\insightface\models\antelopev2\scrfd_10g_bnkps.onnx detection [1, 3, '?', '?'] 127.5 128.0 set det-size: (640, 640) Loaded EVA02-CLIP-L-14-336 model config. Shape of rope freq: torch.Size([576, 64]) Loading pretrained EVA02-CLIP-L-14-336 weights (eva_clip). incompatible_keys.missing_keys: ['visual.rope.freqs_cos', 'visual.rope.freqs_sin', 'visual.blocks.0.attn.rope.freqs_cos', 'visual.blocks.0.attn.rope.freqs_sin', 'visual.blocks.1.attn.rope.freqs_cos', 'visual.blocks.1.attn.rope.freqs_sin', 'visual.blocks.2.attn.rope.freqs_cos', 'visual.blocks.2.attn.rope.freqs_sin', 'visual.blocks.3.attn.rope.freqs_cos', 'visual.blocks.3.attn.rope.freqs_sin', 'visual.blocks.4.attn.rope.freqs_cos', 'visual.blocks.4.attn.rope.freqs_sin', 'visual.blocks.5.attn.rope.freqs_cos', 'visual.blocks.5.attn.rope.freqs_sin', 'visual.blocks.6.attn.rope.freqs_cos', 'visual.blocks.6.attn.rope.freqs_sin', 'visual.blocks.7.attn.rope.freqs_cos', 'visual.blocks.7.attn.rope.freqs_sin', 'visual.blocks.8.attn.rope.freqs_cos', 'visual.blocks.8.attn.rope.freqs_sin', 'visual.blocks.9.attn.rope.freqs_cos', 'visual.blocks.9.attn.rope.freqs_sin', 'visual.blocks.10.attn.rope.freqs_cos', 'visual.blocks.10.attn.rope.freqs_sin', 'visual.blocks.11.attn.rope.freqs_cos', 'visual.blocks.11.attn.rope.freqs_sin', 'visual.blocks.12.attn.rope.freqs_cos', 'visual.blocks.12.attn.rope.freqs_sin', 'visual.blocks.13.attn.rope.freqs_cos', 'visual.blocks.13.attn.rope.freqs_sin', 'visual.blocks.14.attn.rope.freqs_cos', 'visual.blocks.14.attn.rope.freqs_sin', 'visual.blocks.15.attn.rope.freqs_cos', 'visual.blocks.15.attn.rope.freqs_sin', 'visual.blocks.16.attn.rope.freqs_cos', 'visual.blocks.16.attn.rope.freqs_sin', 'visual.blocks.17.attn.rope.freqs_cos', 'visual.blocks.17.attn.rope.freqs_sin', 'visual.blocks.18.attn.rope.freqs_cos', 'visual.blocks.18.attn.rope.freqs_sin', 'visual.blocks.19.attn.rope.freqs_cos', 'visual.blocks.19.attn.rope.freqs_sin', 'visual.blocks.20.attn.rope.freqs_cos', 'visual.blocks.20.attn.rope.freqs_sin', 'visual.blocks.21.attn.rope.freqs_cos', 'visual.blocks.21.attn.rope.freqs_sin', 'visual.blocks.22.attn.rope.freqs_cos', 'visual.blocks.22.attn.rope.freqs_sin', 'visual.blocks.23.attn.rope.freqs_cos', 'visual.blocks.23.attn.rope.freqs_sin'] Loading PuLID-Flux model. Found quantization metadata version 1 Detected mixed precision quantization Using mixed precision operations model weight dtype torch.bfloat16, manual cast: torch.bfloat16 model_type FLUX unet unexpected: ['scaled_fp8'] Requested to load PulidFluxModel loaded completely; 95367431640625005117571072.00 MB usable, 1085.10 MB loaded, full load: True Requested to load Flux loaded partially; 8484.78 MB usable, 8376.65 MB loaded, 2976.38 MB offloaded, 108.02 MB buffer reserved, lowvram patches: 0 100%|██████████| 30/30 [02:01<00:00, 4.04s/it] Requested to load AutoencodingEngine Unloaded partially: 4089.86 MB freed, 4286.79 MB remains loaded, 162.11 MB buffer reserved, lowvram patches: 0 loaded completely; 855.80 MB usable, 159.87 MB loaded, full load: True Prompt executed in 256.18 seconds
[On success-the second time]
got prompt 11:03:27 loaded partially; 8480.79 MB usable, 8372.35 MB loaded, 2980.68 MB offloaded, 108.02 MB buffer reserved, lowvram patches: 0 100%|██████████| 30/30 [02:00<00:00, 4.00s/it] Requested to load AutoencodingEngine Unloaded partially: 4085.55 MB freed, 4286.79 MB remains loaded, 162.11 MB buffer reserved, lowvram patches: 0 loaded completely; 853.80 MB usable, 159.87 MB loaded, full load: True Prompt executed in 167.23 seconds