ComfyUI-CogVideoXWrapper Is it possible to use SageAttention in Windows?

I can't seem to get it working. If someone did, could you help me out?

Oct 08 '24 22:10 jpgallegoar

It requires Triton, which isn't available for Windows, so currently no.

Oct 09 '24 06:10 kijai

Can we use this method to install Trition in Windows? https://blog.csdn.net/qyhua/article/details/136470715

Oct 09 '24 07:10 joeluo

Can we use this method to install Trition in Windows? https://blog.csdn.net/qyhua/article/details/136470715

You can always try, last time I tried I could get it installed, but the cuda functions didn't work so it wasn't useful.

Oct 09 '24 07:10 kijai

Thanks a lot. I used to try it for lama cleaner, but it couldn't work. I thought it was lama cleaner causing the problem. Now I know.

Oct 09 '24 08:10 joeluo

Thanks for the responses. I've been trying for a while, could get triton installed and sageattention, but there's countless errors and, in the end, it doesn't work. Let's hope the triton team ever releases a windows version but they said they wouldn't.

Oct 09 '24 08:10 jpgallegoar

I tried it on Linux and found that it only works in fp8 fast_mode, which is only available for RTX40XX. Sad with my 3090

Oct 09 '24 13:10 cheezecrisp

I tried it on Linux and found that it only works in fp8 fast_mode, which is only available for RTX40XX. Sad with my 3090

Really? It does run in bf16 and fp16 for me on 4090, haven't been able to try on 3090.

Oct 09 '24 14:10 kijai

I tried it on Linux and found that it only works in fp8 fast_mode, which is only available for RTX40XX. Sad with my 3090

Really? It does run in bf16 and fp16 for me on 4090, haven't been able to try on 3090.

Well, I don't know what was wrong actually. When running no matter with fp8_transformer enabled or disabled it always gave following errors

!!! Exception during processing !!! at 14:12: off_blk = tl.program_id(0) x_offset = off_b * L * C offs_m = off_blk*BLK + tl.arange(0, BLK) offs_k = tl.arange(0, C)

x_ptrs = X + x_offset + offs_m[:, None] * C + offs_k[None, :]
x_int8_ptrs = X_int8 + x_offset + offs_m[:, None] * C + offs_k[None, :]
scale_ptrs = Scale + off_b * scale_stride + off_blk  

x = tl.load(x_ptrs, mask=offs_m[:, None] < L)
x *= (C**-0.5 * 1.44269504)
scale = tl.max(tl.abs(x)) / 127.
        ^

triton.compiler.errors.CompilationError: at 2:11: def _elementwise_max(a, b): return core.maximum(a, b) ^ RecursionError('maximum recursion depth exceeded in comparison')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/triton/language/core.py", line 35, in wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/triton/language/core.py", line 1976, in reduce return reduce((input, ), axis, combine_fn, keep_dims=keep_dims, _builder=_builder, _generator=_generator)[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/triton/language/core.py", line 35, in wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/triton/language/core.py", line 2003, in reduce ret = semantic.reduction(input, axis, make_combine_region, _builder) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/triton/language/semantic.py", line 1455, in reduction region_builder_fn(reduce_op) File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/triton/language/core.py", line 1987, in make_combine_region results = _generator.call_JitFunction(combine_fn, args, kwargs={}) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ triton.compiler.errors.CompilationError: at 15:15: if return_indices_tie_break_left: return core._reduce_with_indices(input, axis, _argmax_combine_tie_break_left, keep_dims=keep_dims) else: return core._reduce_with_indices(input, axis, _argmax_combine_tie_break_fast, keep_dims=keep_dims) else: if core.constexpr(input.dtype.primitive_bitwidth) < core.constexpr(32): if core.constexpr(input.dtype.is_floating()): input = input.to(core.float32) else: assert input.dtype.is_int(), "Expecting input to be integer type" input = input.to(core.int32) return core.reduce(input, axis, _elementwise_max, keep_dims=keep_dims) ^

The above exception was the direct cause of the following exception:

triton.compiler.errors.CompilationError: at 15:15: if return_indices_tie_break_left: return core._reduce_with_indices(input, axis, _argmax_combine_tie_break_left, keep_dims=keep_dims) else: return core._reduce_with_indices(input, axis, _argmax_combine_tie_break_fast, keep_dims=keep_dims) else: if core.constexpr(input.dtype.primitive_bitwidth) < core.constexpr(32): if core.constexpr(input.dtype.is_floating()): input = input.to(core.float32) else: assert input.dtype.is_int(), "Expecting input to be integer type" input = input.to(core.int32) return core.reduce(input, axis, _elementwise_max, keep_dims=keep_dims) ^

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/cheezecrisp/AI/ComfyUI/execution.py", line 323, in execute output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/execution.py", line 198, in get_output_data return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/execution.py", line 169, in _map_node_over_list process_inputs(input_dict, i) File "/home/cheezecrisp/AI/ComfyUI/execution.py", line 158, in process_inputs results.append(getattr(obj, func)(**inputs)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/custom_nodes/ComfyUI-CogVideoXWrapper/nodes.py", line 1075, in process latents = pipe( ^^^^^ File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/custom_nodes/ComfyUI-CogVideoXWrapper/cogvideox_fun/pipeline_cogvideox_inpaint.py", line 1079, in call noise_pred = self.transformer( ^^^^^^^^^^^^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/custom_nodes/ComfyUI-CogVideoXWrapper/cogvideox_fun/transformer_3d.py", line 672, in forward hidden_states, encoder_hidden_states = block( ^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/custom_nodes/ComfyUI-CogVideoXWrapper/cogvideox_fun/transformer_3d.py", line 327, in forward attn_hidden_states, attn_encoder_hidden_states = self.attn1( ^^^^^^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/diffusers/models/attention_processor.py", line 490, in forward return self.processor( ^^^^^^^^^^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/custom_nodes/ComfyUI-CogVideoXWrapper/cogvideox_fun/transformer_3d.py", line 100, in call hidden_states = sageattn(query, key, value, is_causal=False) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/sageattention/core.py", line 41, in sageattn q_int8, q_scale, k_int8, k_scale = per_block_int8(q, k) ^^^^^^^^^^^^^^^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/sageattention/quant_per_block.py", line 63, in per_block_int8 q_kernel_per_block_int8[grid]( File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/triton/runtime/jit.py", line 345, in return lambda *args, **kwargs: self.run(grid=grid, warmup=False, args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/triton/runtime/jit.py", line 662, in run kernel = self.compile( ^^^^^^^^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/triton/compiler/compiler.py", line 276, in compile module = src.make_ir(options, codegen_fns, context) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/triton/compiler/compiler.py", line 113, in make_ir return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ triton.compiler.errors.CompilationError: at 14:12: off_blk = tl.program_id(0) x_offset = off_b * L * C offs_m = off_blkBLK + tl.arange(0, BLK) offs_k = tl.arange(0, C)

x_ptrs = X + x_offset + offs_m[:, None] * C + offs_k[None, :]
x_int8_ptrs = X_int8 + x_offset + offs_m[:, None] * C + offs_k[None, :]
scale_ptrs = Scale + off_b * scale_stride + off_blk  

x = tl.load(x_ptrs, mask=offs_m[:, None] < L)
x *= (C**-0.5 * 1.44269504)
scale = tl.max(tl.abs(x)) / 127.
        ^

Prompt executed in 71.51 seconds

Oct 09 '24 15:10 cheezecrisp