Triton fork for Windows support

See release/3.3.x-windows branch for the code, forked from release/3.3.x branch of the official repo. The main-windows branch is unstable and may be force pushed.

Based on andreigh, wkpark, mantaionut, eaplatanios, anmyachev, and more development in the community. Thank you all!

Why?

Free software should run on non-free platforms, as per Richard Stallman
This is required by torch.compile, and torchao, SageAttention, Unsloth, and more packages
Memory/disk swap on WSL is hard
Catgirl matters

Progress

triton.jit and torch.compile just work
All unit tests passed (Thanks Comfy Org for generously providing the CI runners!)
When I run Flux or HunyuanVideo in ComfyUI on Windows, it's almost as fast as on WSL on the same machine
Windows 10 and 11 are supported
Only Nvidia GPU is supported, help wanted to support other backends
- For AMD GPU, you may try https://github.com/Repeerc/triton-amdgpu-windows

Installation

Triton accelerates your AI model by compiling things on your computer. You need to install it in the correct environment.

1. GPU

Check your GPU model. Technically they're categorized by 'compute capability' (also known as 'CUDA arch' or 'sm'), and here I use RTX models for example:

RTX 50xx (Blackwell)

This only works with Triton >= 3.3, PyTorch >= 2.7, and CUDA 12.8 .

RTX 40xx (Ada)

This is officially supported by Triton.

RTX 30xx (Ampere)

This is officially supported by Triton, but fp8 (also known as float8) will not work, see the known issue. I recommend to use GGUF instead of fp8 models in this case.

RTX 20xx (Turing) or older

This is not officially supported by Triton. It can run some simple AI models, but not always. fp8 (also known as float8) and bf16 (also known as bfloat16) will not work. I recommend to use GGUF instead of fp8 or bf16 models in this case.

2. Python environment

Check how your Python is installed. Either of the following environments is supported:

Embeded: You use an all-in-one package of ComfyUI (or some other AI software)
- There should be a folder python_embeded in the ComfyUI installation folder
  - For FramePack, it's system\python in the FramePack installation folder
  - Other AI software may put this folder at a different path
- In this case, don't directly run python, but use the full path C:\path\to\python_embeded\python.exe
- Also, don't directly run pip, but instead run C:\path\to\python_embeded\python.exe -m pip
- By default there is no pip.exe in the folder python_embeded. If you directly run pip, you're actually running a pip.exe installed somewhere else on your computer
- It's ok to first cd to python_embeded, then run .\python.exe, but remember to add .\ to run an executable in the current folder. In PowerShell, without .\, you're still running a python.exe installed somewhere else on your computer
System-wide: You install Python at a location like C:\Python312\ or C:\Program Files\Python312\ and directly use it
User-wide: You install Python at a location like C:\Users\<your username>\AppData\Local\Programs\Python\Python312\ and directly use it
conda: You create a virtual environment using conda
Python venv: You create a virtual environment using venv or virtualenv

I don't recommend installing Python from Windows Store, because it's complicated to interact with a 'packaged' Windows app.

For other environment managers like poetry or uv, if you find problems, please open an issue.

Make sure what environment you're using. You can run Get-Command -All python in PowerShell (or where python in cmd) to see the installation path of Python, and python --version to see its version. If you see multiple Python installations, make sure that you install and run everything from the first one.

For example, if you think you're using Python 3.12, but pip downloads a wheel with cp311 in its name, then it means you're not using the Python environment you think

Don't mix two environments, unless you know them very well.

If you're using ComfyUI with embeded Python, then don't use conda or venv
If you're already using conda, then always create a new env using conda, and don't use Python venv

3. PyTorch

Although technically Triton can be used alone, in the following let's assume you use it with PyTorch. Check your PyTorch version:

Triton 3.3 works with PyTorch >= 2.7 .

Triton 3.2 works with PyTorch >= 2.6 .

Triton 3.1 works with PyTorch >= 2.4 . PyTorch 2.3 and older are not supported.

4. CUDA

Since the release triton-windows==3.2.0.post11, a minimal CUDA toolchain is bundled in the Triton wheels, so you don't need to manually install it.

Triton 3.1 and 3.2 bundles CUDA 12.4, and Triton 3.3 bundles CUDA 12.8 , see nvidia-toolchain-version.json

If you need to override the CUDA toolchain, you can set the environment variable CUDA_PATH.

Instructions for older or custom wheels without bundled CUDA

CUDA 12 is required. CUDA 11 and older are not supported. Choose either of the following ways to install CUDA:

a) System-wide: Recommended for most people

Expand

Install PyTorch with CUDA using pip
Install CUDA toolkit from CUDA toolkit archive
When installing, you need to choose both 'CUDA Development' and 'CUDA Runtime'. Make sure these folders exist on your computer: (Change the version number according to your installation)
```
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\lib\x64
```
Then you need to add the path of CUDA to the Windows PATH:
- The path is like C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\bin
- Make sure this folder exists
If you open a new PowerShell, type ptxas --version, and it shows your CUDA version like Cuda compilation tools, release 12.8, V12.8.61, then you're doing right

b) conda: Do this only if you're already using conda

Expand

Install the following packages:

conda install -c conda-forge cuda-nvcc pytorch-gpu

Starting from PyTorch 2.6, PyTorch is no longer released in pytorch channel, and it should be installed in conda-forge channel

c) pip: Do this if you don't want to install too much boilerplate, and you want to contain everything in a venv, with minimal impact to the system

Expand

Install PyTorch with CUDA using pip

Install the following packages:

pip install nvidia-cuda-nvcc-cu12 nvidia-cuda-runtime-cu12

There should be a folder Lib\site-packages\nvidia\cuda_runtime\ in your Python installation path (or venv), and you need to add a library in it
- Download it from https://github.com/woct0rdho/triton-windows/releases/download/v3.2.0-windows.post9/cuda_12.8_lib.zip
- Choose 12.4, 12.6, or 12.8 according to your CUDA version
- Put the folder lib into cuda_runtime

For details about version compatibility of various pip packages and CUDA, see https://github.com/woct0rdho/triton-windows/issues/43

5. C compiler

Since the release triton-windows==3.2.0.post13, TinyCC is bundled in the Triton wheels, so you don't need to manually install a C compiler to use Triton. Packages that directly call triton.jit, such as SageAttention, will just work.

You still need to install a C++ compiler if you use torch.compile targeting CPU. This may happen when you use nodes like 'CompileModel' in ComfyUI. Triton does not affect how PyTorch configures the C++ compiler in this case.

If you need to override the C compiler, you can set the environment variable CC. MSVC, GCC, and Clang are supported for the JIT compilation in Triton.

Instructions for older or custom wheels without bundled TinyCC

If you don't have a C compiler, I recommend to install MSVC and Windows SDK.

You can install them in Visual Studio
- If you don't want to install the whole Visual Studio, you can just install Visual Studio Build Tools
Visual Studio >= 2017 is supported
Choose the latest version of MSVC and Windows SDK from the list

Then you need to add the path containing cl.exe to the Windows PATH:

The path is like C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.43.34808\bin\Hostx64\x64
Change the version numbers according to your installation, and make sure this folder accually exists on your computer
If you open a new PowerShell, type cl, and it shows Microsoft (R) C/C++ Optimizing Compiler ..., then you're doing right

Note on automatically adding the path

(Do this if you don't want to permanently modify the Windows PATH)

Before running Python, if you use PowerShell, run the following: (Find the ps1 file according to your installation)

&"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\Common7\Tools\Launch-VsDevShell.ps1" -Arch amd64

Or if you use cmd, run the following: (This is equivalent to 'x64 Native Tools Command Prompt' from the Start menu)

"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\Common7\Tools\VsDevCmd.bat" -arch=amd64

It automatically adds the paths containing cl.exe and other relevant VS components, see https://github.com/woct0rdho/triton-windows/issues/79 . Although it does not set the environment variable CC, it sets VCINSTALLDIR, VCToolsVersion, WindowsSdkDir, WindowsSDKVersion, and Triton will recognize them.

6. vcredist

vcredist is required (also known as 'Visual C++ Redistributable for Visual Studio 2015-2022', msvcp140.dll, vcruntime140.dll), because libtriton.pyd is compiled by MSVC. Install it from https://aka.ms/vs/17/release/vc_redist.x64.exe

7. Triton

Since the release triton-windows==3.2.0.post11, the wheels are published to https://pypi.org/project/triton-windows/

If you've installed an old version of triton, first uninstall it:

pip uninstall triton

Now you can install triton-windows 3.3, or upgrade the already installed version. To prevent breaking with your installed PyTorch when a new version of Triton is released in future, you can limit the version to be < 3.4:

pip install -U "triton-windows<3.4"

Note again that if you're using the embeded Python, then instead of directly run pip, you need:

C:\path\to\python_embeded\python.exe -m pip install -U "triton-windows<3.4"

For Triton 3.2, you need:

pip install -U "triton-windows<3.3"

8. Special notes for ComfyUI with embeded Python

There should be a Python folder
- For ComfyUI, it's python_embeded in the ComfyUI installation folder
- For FramePack, it's system\python in the FramePack installation folder
- Other AI software may put the Python folder at a different path
- If you created a venv, depending on how you created it, the Python folder may be the venv folder or the venv\Scripts\ folder
- If you're not sure, you can run os.path.dirname(sysconfig.get_paths()["include"]) to find the Python folder, see py_include_dir
You need to put two folders include and libs into the Python folder to make Triton work
- Be careful: It is 'libs', not 'lib'. There may already be a folder Lib in the Python folder, containing things like site-packages or __future__.py. You should not modify the Lib folder
- If you're using ComfyUI_windows_portable >= 0.2.4 with Python 3.12, then download the two folders here: python_3.12.7_include_libs.zip
- If you're using FramePack with Python 3.10, then download the two folders here: python_3.10.11_include_libs.zip
- The minor version (3.9/3.10 ...) must be correct, but the patch version (3.10.6/3.10.7 ...) can be different
- If you're using another Python version, you can find the two folders at https://github.com/woct0rdho/triton-windows/releases/v3.0.0-windows.post1/
(For developers: This is equivalent to python-dev on Linux, and you can obtain the two folders from nuget when bundling Python in your app, see https://github.com/comfyanonymous/ComfyUI/pull/7200 )

Test if it works

Before using Triton in larger projects like ComfyUI, please run the following script to test if Triton itself works.

You need to save the code in a file, such as test_triton.py, then run python test_triton.py
When you open an issue, please show the command you use to run this test, and the full error log

import torch
import triton
import triton.language as tl

@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    output = x + y
    tl.store(output_ptr + offsets, output, mask=mask)

def add(x: torch.Tensor, y: torch.Tensor):
    output = torch.empty_like(x)
    n_elements = output.numel()
    grid = lambda meta: (triton.cdiv(n_elements, meta["BLOCK_SIZE"]),)
    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
    return output

a = torch.rand(3, device="cuda")
b = a + a
b_compiled = add(a, a)
print(b_compiled - b)
print("If you see tensor([0., 0., 0.], device='cuda:0'), then it works")

Troubleshoot the test above

`ModuleNotFoundError: No module named 'triton.language'; 'triton' is not a package`

Don't name the test script triton.py. Also, check if there is a folder named triton in your current directory. If so, Python will think it's the 'triton' package and fail to import.

`AttributeError: module 'pkgutil' has no attribute 'ImpImporter'. Did you mean: 'zipimporter'`

This is because your setuptools is outdated. Run the following and try again:

python -m ensurepip -U
python -m pip install -U pip
python -m pip install -U setuptools

`PermissionError: [WinError 5] Access is denied: 'C:\\Users\\<your username>\\.triton'`

This is because of the permission settings of your user folder, see https://github.com/lllyasviel/FramePack/issues/221

`ImportError: DLL load failed while importing libtriton`

This is usually because your vcredist DLLs are too old.

If you're using conda, then you may try:

conda

conda install -c conda-forge vc14_runtime

If you're not using conda, then you need to find the vcredist DLLs (vcruntime140.dll, vcruntime140_1.dll) in your Python installation folder:

Embeded Python (You use an all-in-one package of ComfyUI or some other AI software)

For ComfyUI, the DLLs should be in the folder python_embeded.
For FramePack, it's system\python in the FramePack installation folder
Other AI software may put this folder at a different path

Other Python installation (system-wide/user-wide/venv)

If you're not sure, you can run the following in the same Python environment:

import sysconfig
print(sysconfig.get_paths())

For example, it may show {'stdlib': 'C:\\Python312\\Lib', 'platstdlib': 'C:\\tmp\\.venv\\Lib', ...}, where stdlib shows that the 'base' Python installation folder (not the venv folder) is C:\Python312\ (without the last Lib). The DLLs should be in this folder.

After finding the DLLs in the Python installation folder, you can install the latest vcredist, then copy the DLLs msvcp140.dll, vcruntime140.dll, vcruntime140_1.dll from C:\Windows\System32\ to the Python installation folder, and replace the existing ones.

You can right-click the DLL -> Properties -> Details to see its version. A new enough version, such as 14.42, is required by my Triton wheels.

`ImportError: DLL load failed while importing cuda_utils`

Delete the cache folders:
```
C:\Users\<your username>\.triton\cache\
C:\Users\<your username>\AppData\Local\Temp\torchinductor_<your username>\
```
- You may also need to delete these cache folders when you change the Python version, install another version of Triton, or change the C compiler or CUDA
- It's ok if these folders do not exist on your computer. The first folder exists only if you have used triton.jit (which is used by packages like SageAttention), and the second folder exists only if you have used torch.compile
Double check your Python version: You can run Get-Command -All python in PowerShell (or where python in cmd) to see the installation path of Python, and python --version to see its version. If you see multiple Python installations, make sure that you install and run everything from the first one
If you're using ComfyUI with embeded Python, make sure that you copy-pasted the folders include and libs from the correct version of Python

`SystemError: PY_SSIZE_T_CLEAN macro must be defined for '#' formats`

You also need to delete the cache folders above.

dlltracer

If the above still doesn't work, you may try:

Install dlltracer in the same Python environment
In an administrator PowerShell, run the following script:

import sys
import dlltracer

print("import torch")
with dlltracer.Trace(out=sys.stdout):
    import torch

print("import triton")
with dlltracer.Trace(out=sys.stdout):
    import triton

print("begin definition")
with dlltracer.Trace(out=sys.stdout):
    import triton.language as tl

    @triton.jit
    def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
        pid = tl.program_id(axis=0)
        block_start = pid * BLOCK_SIZE
        offsets = block_start + tl.arange(0, BLOCK_SIZE)
        mask = offsets < n_elements
        x = tl.load(x_ptr + offsets, mask=mask)
        y = tl.load(y_ptr + offsets, mask=mask)
        output = x + y
        tl.store(output_ptr + offsets, output, mask=mask)

    def add(x: torch.Tensor, y: torch.Tensor):
        output = torch.empty_like(x)
        n_elements = output.numel()
        grid = lambda meta: (triton.cdiv(n_elements, meta["BLOCK_SIZE"]),)
        add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
        return output

print("begin torch add")
with dlltracer.Trace(out=sys.stdout):
    a = torch.rand(3, device="cuda")
    b = a + a

print("begin jit add")
with dlltracer.Trace(out=sys.stdout):
    b_compiled = add(a, a)

print(b_compiled - b)
print("If you see tensor([0., 0., 0.], device='cuda:0'), then it works")

Open an issue. Please show the command you use to run this test, and the full error log

If it shows PermissionError: [WinError 5] failed to start trace (0x00000005), then you need to make sure to run it as administrator.

(Security reminder: You don't need the administrator privilege to run Triton and other usual Python code. It's only dlltracer that needs it.)

If it shows Failed \Device\...\cuda_utils.pyd, please also:

Find cuda_utils.pyd at this location
Use DependenciesGui (or similar tools) to check what DLLs this cuda_utils.pyd depends on, and send a screenshot (or other related information) in the issue

Known issues

Windows file path length limit (260) causes compilation failure

torch.compile may create temp files with very long filenames, causing errors like:

  File "C:\...\Lib\site-packages\torch\_inductor\runtime\triton_heuristics.py", line 537, in _precompile_config
    binary = triton.compile(*compile_args, **compile_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\...\Lib\site-packages\triton\compiler\compiler.py", line 288, in compile
    metadata_group[ir_filename] = fn_cache_manager.put(next_module, ir_filename)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\...\Lib\site-packages\triton\runtime\cache.py", line 122, in put
    with open(temp_path, mode) as f:
         ^^^^^^^^^^^^^^^^^^^^^
torch._inductor.exc.InductorError: FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\<your username>\\AppData\\Local\\Temp\\torchinductor_<your username>\\triton\\0\\...LONG...FILE...NAME...'

Or errors like:

[WinError 206] The filename or extension is too long

The solution is to enable Windows' long path support. A reboot is required after the modification.

fp8 is not supported on RTX 30xx and older GPUs

If you see errors like

torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
CompilationError: at 8:11:
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 196608
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = tl.full([XBLOCK], True, tl.int1)
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), None)
    tmp1 = tmp0.to(tl.float32)
           ^

and in the full error log you find

AssertionError: fp8e4nv data type is not supported on CUDA arch < 89

then it's because in Triton, fp8 only works on Nvidia GPUs with sm >= 89, such as RTX 40xx and newer. You may disable fp8 in the node or the code.

This is not Windows-specific. It should be possible to emulate fp8 on older hardware like XLA does (see https://github.com/openxla/xla/discussions/23124 ), even if without time or memory improvement compared to fp16. Help wanted if anyone has time for this.

Error with `os.rename`

If you see errors like

FileExistsError: [WinError 183] Cannot create a file when that file already exists: ...

then you need: https://github.com/pytorch/pytorch/issues/138211

This has been fixed since PyTorch 2.6 .

Error with model offloading

If you're using ComfyUI, the model is compiled, and you see error messages like

ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)

then you may use --gpu-only when launching ComfyUI to disable model offloading, see https://github.com/woct0rdho/triton-windows/issues/61

No module named 'triton.ops'

triton.ops was removed in Triton 3.1, and this is because some of your Python package is outdated (most likely bitsandbytes), see https://github.com/woct0rdho/triton-windows/issues/65

Build from source

See BUILD.md. This is for developers.

triton-windows
triton-windows copied to clipboard

Metadata

Triton fork for Windows support

Why?

Progress

Installation

1. GPU

2. Python environment

3. PyTorch

4. CUDA

5. C compiler

6. vcredist

7. Triton

8. Special notes for ComfyUI with embeded Python

Test if it works

Troubleshoot the test above

`ModuleNotFoundError: No module named 'triton.language'; 'triton' is not a package`

`AttributeError: module 'pkgutil' has no attribute 'ImpImporter'. Did you mean: 'zipimporter'`

`PermissionError: [WinError 5] Access is denied: 'C:\\Users\\<your username>\\.triton'`

`ImportError: DLL load failed while importing libtriton`

`ImportError: DLL load failed while importing cuda_utils`

`SystemError: PY_SSIZE_T_CLEAN macro must be defined for '#' formats`

dlltracer

Known issues

Windows file path length limit (260) causes compilation failure

fp8 is not supported on RTX 30xx and older GPUs

Error with `os.rename`

Error with model offloading

No module named 'triton.ops'

Build from source

← Metadata

Owner

Metadata

triton-windows triton-windows copied to clipboard

Metadata

Triton fork for Windows support

Why?

Progress

Installation

1. GPU

2. Python environment

3. PyTorch

4. CUDA

5. C compiler

6. vcredist

7. Triton

8. Special notes for ComfyUI with embeded Python

Test if it works

Troubleshoot the test above

ModuleNotFoundError: No module named 'triton.language'; 'triton' is not a package

AttributeError: module 'pkgutil' has no attribute 'ImpImporter'. Did you mean: 'zipimporter'

PermissionError: [WinError 5] Access is denied: 'C:\\Users\\<your username>\\.triton'

ImportError: DLL load failed while importing libtriton

ImportError: DLL load failed while importing cuda_utils

SystemError: PY_SSIZE_T_CLEAN macro must be defined for '#' formats

dlltracer

Known issues

Windows file path length limit (260) causes compilation failure

fp8 is not supported on RTX 30xx and older GPUs

Error with os.rename

Error with model offloading

No module named 'triton.ops'

Build from source

← Metadata

Owner

Metadata

triton-windows
triton-windows copied to clipboard

`ModuleNotFoundError: No module named 'triton.language'; 'triton' is not a package`

`AttributeError: module 'pkgutil' has no attribute 'ImpImporter'. Did you mean: 'zipimporter'`

`PermissionError: [WinError 5] Access is denied: 'C:\\Users\\<your username>\\.triton'`

`ImportError: DLL load failed while importing libtriton`

`ImportError: DLL load failed while importing cuda_utils`

`SystemError: PY_SSIZE_T_CLEAN macro must be defined for '#' formats`

Error with `os.rename`