Is it possible to use SageAttention in Windows?
I can't seem to get it working. If someone did, could you help me out?
It requires Triton, which isn't available for Windows, so currently no.
Can we use this method to install Trition in Windows? https://blog.csdn.net/qyhua/article/details/136470715
Can we use this method to install Trition in Windows? https://blog.csdn.net/qyhua/article/details/136470715
You can always try, last time I tried I could get it installed, but the cuda functions didn't work so it wasn't useful.
Thanks a lot. I used to try it for lama cleaner, but it couldn't work. I thought it was lama cleaner causing the problem. Now I know.
Thanks for the responses. I've been trying for a while, could get triton installed and sageattention, but there's countless errors and, in the end, it doesn't work. Let's hope the triton team ever releases a windows version but they said they wouldn't.
I tried it on Linux and found that it only works in fp8 fast_mode, which is only available for RTX40XX. Sad with my 3090
I tried it on Linux and found that it only works in fp8 fast_mode, which is only available for RTX40XX. Sad with my 3090
Really? It does run in bf16 and fp16 for me on 4090, haven't been able to try on 3090.
I tried it on Linux and found that it only works in fp8 fast_mode, which is only available for RTX40XX. Sad with my 3090
Really? It does run in bf16 and fp16 for me on 4090, haven't been able to try on 3090.
Well, I don't know what was wrong actually. When running no matter with fp8_transformer enabled or disabled it always gave following errors
!!! Exception during processing !!! at 14:12: off_blk = tl.program_id(0) x_offset = off_b * L * C offs_m = off_blk*BLK + tl.arange(0, BLK) offs_k = tl.arange(0, C)
x_ptrs = X + x_offset + offs_m[:, None] * C + offs_k[None, :]
x_int8_ptrs = X_int8 + x_offset + offs_m[:, None] * C + offs_k[None, :]
scale_ptrs = Scale + off_b * scale_stride + off_blk
x = tl.load(x_ptrs, mask=offs_m[:, None] < L)
x *= (C**-0.5 * 1.44269504)
scale = tl.max(tl.abs(x)) / 127.
^
triton.compiler.errors.CompilationError: at 2:11: def _elementwise_max(a, b): return core.maximum(a, b) ^ RecursionError('maximum recursion depth exceeded in comparison')
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/triton/language/core.py", line 35, in wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/triton/language/core.py", line 1976, in reduce return reduce((input, ), axis, combine_fn, keep_dims=keep_dims, _builder=_builder, _generator=_generator)[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/triton/language/core.py", line 35, in wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/triton/language/core.py", line 2003, in reduce ret = semantic.reduction(input, axis, make_combine_region, _builder) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/triton/language/semantic.py", line 1455, in reduction region_builder_fn(reduce_op) File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/triton/language/core.py", line 1987, in make_combine_region results = _generator.call_JitFunction(combine_fn, args, kwargs={}) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ triton.compiler.errors.CompilationError: at 15:15: if return_indices_tie_break_left: return core._reduce_with_indices(input, axis, _argmax_combine_tie_break_left, keep_dims=keep_dims) else: return core._reduce_with_indices(input, axis, _argmax_combine_tie_break_fast, keep_dims=keep_dims) else: if core.constexpr(input.dtype.primitive_bitwidth) < core.constexpr(32): if core.constexpr(input.dtype.is_floating()): input = input.to(core.float32) else: assert input.dtype.is_int(), "Expecting input to be integer type" input = input.to(core.int32) return core.reduce(input, axis, _elementwise_max, keep_dims=keep_dims) ^
The above exception was the direct cause of the following exception:
triton.compiler.errors.CompilationError: at 15:15: if return_indices_tie_break_left: return core._reduce_with_indices(input, axis, _argmax_combine_tie_break_left, keep_dims=keep_dims) else: return core._reduce_with_indices(input, axis, _argmax_combine_tie_break_fast, keep_dims=keep_dims) else: if core.constexpr(input.dtype.primitive_bitwidth) < core.constexpr(32): if core.constexpr(input.dtype.is_floating()): input = input.to(core.float32) else: assert input.dtype.is_int(), "Expecting input to be integer type" input = input.to(core.int32) return core.reduce(input, axis, _elementwise_max, keep_dims=keep_dims) ^
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/cheezecrisp/AI/ComfyUI/execution.py", line 323, in execute
output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cheezecrisp/AI/ComfyUI/execution.py", line 198, in get_output_data
return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cheezecrisp/AI/ComfyUI/execution.py", line 169, in _map_node_over_list
process_inputs(input_dict, i)
File "/home/cheezecrisp/AI/ComfyUI/execution.py", line 158, in process_inputs
results.append(getattr(obj, func)(**inputs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cheezecrisp/AI/ComfyUI/custom_nodes/ComfyUI-CogVideoXWrapper/nodes.py", line 1075, in process
latents = pipe(
^^^^^
File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/cheezecrisp/AI/ComfyUI/custom_nodes/ComfyUI-CogVideoXWrapper/cogvideox_fun/pipeline_cogvideox_inpaint.py", line 1079, in call
noise_pred = self.transformer(
^^^^^^^^^^^^^^^^^
File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cheezecrisp/AI/ComfyUI/custom_nodes/ComfyUI-CogVideoXWrapper/cogvideox_fun/transformer_3d.py", line 672, in forward
hidden_states, encoder_hidden_states = block(
^^^^^^
File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cheezecrisp/AI/ComfyUI/custom_nodes/ComfyUI-CogVideoXWrapper/cogvideox_fun/transformer_3d.py", line 327, in forward
attn_hidden_states, attn_encoder_hidden_states = self.attn1(
^^^^^^^^^^^
File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/diffusers/models/attention_processor.py", line 490, in forward
return self.processor(
^^^^^^^^^^^^^^^
File "/home/cheezecrisp/AI/ComfyUI/custom_nodes/ComfyUI-CogVideoXWrapper/cogvideox_fun/transformer_3d.py", line 100, in call
hidden_states = sageattn(query, key, value, is_causal=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/sageattention/core.py", line 41, in sageattn
q_int8, q_scale, k_int8, k_scale = per_block_int8(q, k)
^^^^^^^^^^^^^^^^^^^^
File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/sageattention/quant_per_block.py", line 63, in per_block_int8
q_kernel_per_block_int8[grid](
File "/home/cheezecrisp/AI/ComfyUI/venv/lib/python3.11/site-packages/triton/runtime/jit.py", line 345, in
x_ptrs = X + x_offset + offs_m[:, None] * C + offs_k[None, :]
x_int8_ptrs = X_int8 + x_offset + offs_m[:, None] * C + offs_k[None, :]
scale_ptrs = Scale + off_b * scale_stride + off_blk
x = tl.load(x_ptrs, mask=offs_m[:, None] < L)
x *= (C**-0.5 * 1.44269504)
scale = tl.max(tl.abs(x)) / 127.
^
Prompt executed in 71.51 seconds