UnboundLocalError: local variable 'self_attn_func' referenced before assignment
File "/lib/python3.10/site-packages/q8_kernels/integration/utils.py", line 104, in get_attention_func
(self_attn_func, self_attn_memory_layout),
UnboundLocalError: local variable 'self_attn_func' referenced before assignment
This kernel file does not throw an error or use the defaults if not using one of the GPU architectures specified in the code.
I also met the same problem in RTX 3090, have you solved it?
Unfortunately, no. I have been using Wan2GP https://github.com/deepbeepmeep/Wan2GP for inference and it works well on 24G cards
Reproducing on RTX 3050 4GB VRAM Laptop (Windows 11), PyTorch 2.8.0+cu128, ComfyUI v0.3.61-1-g6e079abc(2025-09-29) LTXVideo custom node version nightly, trying to use Q8-Kernels to accelerate FP8 inference. Steps to reproduce:
Installed LTXVideo via ComfyUI Manager. Downloaded ltxv-2b-0.9.8-distilled-fp8.safetensors to models/checkpoints. Loaded ltxvideo-i2v-distilled.json workflow. Selected ltxv-2b-0.9.8-distilled-fp8.safetensors in LTXV Loader. Added LTXVideo Q8 Patcher after loader. Input 512x512 image and simple prompt (e.g., "animate a waving hand"). Queued prompt (4-8 steps, 14 frames).
Full Error Log
got prompt model weight dtype torch.float16, manual cast: None model_type FLUX VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16 no CLIP/text encoder weights in checkpoint, the text encoder model will not be loaded. Requested to load MochiTEModel_ loaded completely 9.5367431640625e+25 9083.38671875 True CLIP/text encoder model load device: cpu, offload device: cpu, current: cpu, dtype: torch.float16 Requested to load VideoVAE loaded partially 1882.2000005722045 1882.1682147979736 0 !!! Exception during processing !!! local variable 'self_attn_func' referenced before assignment Traceback (most recent call last): File "ComfyUI/execution.py", line 496, in execute output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs) File "ComfyUI/execution.py", line 315, in get_output_data return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs) File "ComfyUI/execution.py", line 289, in _async_map_node_over_list await process_inputs(input_dict, i) File "ComfyUI/execution.py", line 277, in process_inputs result = f(**inputs) File "ComfyUI/custom_nodes/ComfyUI-LTXVideo/q8_nodes.py", line 62, in patch patcher(transformer, use_fp8_attention, True) File "venv/lib/site-packages/q8_kernels/integration/patch_transformer.py", line 26, in patch_comfyui_native_transformer attn_forward = cross_attn_forward(use_fp8_attention) File "venv/lib/site-packages/q8_kernels/integration/comfyui_native.py", line 66, in cross_attn_forward self_attn, cross_attn, is_out_tuple = get_attention_func(use_fp8_attention) File "venv/lib/site-packages/q8_kernels/integration/utils.py", line 104, in get_attention_func (self_attn_func, self_attn_memory_layout), UnboundLocalError: local variable 'self_attn_func' referenced before assignmentWorkaround: Removing or bypassing the Q8 Patcher node allows the base FP8 model to run (slower, but functional). Happy to test fixes or provide more details!