xformers icon indicating copy to clipboard operation
xformers copied to clipboard

Conditionalize sp24.py for windows support of cusparselt.py

Open NeedsMoar opened this issue 4 months ago • 6 comments

🚀 Feature

I was sad to see how many things in python -m xformers.info weren't enabled on windows so I set out to do something about it.

Literally all that needs to be done is an expansion of the function returning the library name string. I've filed a request that torch include the dll if they weren't planning on it since nvidia offers it but it's not a default item in the CUDA toolkit for some reason. I'm including the code here but not doing a pull request because I don't know whether they're adding the DLL (although it shouldn't actually error if it's missing just as the original version that tries to find a .so file on windows (lol) doesn't.

Motivation

Windows needs love too!

Pitch

import platform, then turn _get_cusparselt_lib() in ops/sp24.py into the following code or similar. No point in a dylib version unless apple massively changes their strategy of not allowing any sort of good accelerator hardware near their computers anymore.

def _get_cusparselt_lib() -> Optional[str]:
    if platform.system() == "Windows":
        libs = glob.glob(str(Path(torch._C.__file__).parent / "lib" / "cusparseLt.dll"))
    else: 
        libs = glob.glob(
            str(Path(torch._C.__file__).parent / "lib" / "libcusparseLt*.so.0")
        )
    if len(libs) != 1:
        return None
    return libs[0]

After that I dropped a downloaded-from-nvidia cusparselt.dll into my site-packages/torch/lib directory, then decided to keep going and found somebody building wheels of triton for Win64 / Python 3.11 and installed those too, and now my xformers.info shows.

C:\Programs\ComfyUI>python -m xformers.info
Unable to find python bindings at /usr/local/dcgm/bindings/python3. No data will be captured.
xFormers 0.0.24
memory_efficient_attention.cutlassF:               available
memory_efficient_attention.cutlassB:               available
memory_efficient_attention.decoderF:               available
[email protected]:        available
[email protected]:        available
memory_efficient_attention.smallkF:                available
memory_efficient_attention.smallkB:                available
memory_efficient_attention.tritonflashattF:        unavailable
memory_efficient_attention.tritonflashattB:        unavailable
memory_efficient_attention.triton_splitKF:         available
indexing.scaled_index_addF:                        available
indexing.scaled_index_addB:                        available
indexing.index_select:                             available
sequence_parallel_fused.write_values:              unavailable
sequence_parallel_fused.wait_values:               unavailable
sequence_parallel_fused.cuda_memset_32b_async:     unavailable
sp24.sparse24_sparsify_both_ways:                  available
sp24.sparse24_apply:                               available
sp24.sparse24_apply_dense_output:                  available
sp24._sparse24_gemm:                               available
[email protected]:                        available
swiglu.dual_gemm_silu:                             available
swiglu.gemm_fused_operand_sum:                     available
swiglu.fused.p.cpp:                                available
is_triton_available:                               True
pytorch.version:                                   2.2.0+cu121
pytorch.cuda:                                      available
gpu.compute_capability:                            8.9
gpu.name:                                          NVIDIA GeForce RTX 4090
dcgm_profiler:                                     unavailable
build.info:                                        available
build.cuda_version:                                1201
build.python_version:                              3.11.7
build.torch_version:                               2.2.0+cu121
build.env.TORCH_CUDA_ARCH_LIST:                    5.0+PTX 6.0 6.1 7.0 7.5 8.0+PTX 9.0
build.env.XFORMERS_BUILD_TYPE:                     Release
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS:        None
build.env.NVCC_FLAGS:                              None
build.env.XFORMERS_PACKAGE_FROM:                   wheel-v0.0.24
build.nvcc_version:                                12.1.66
source.privacy:                                    open source

I think the nccl library sequence_parallel seems to use actually isn't available on windows from nvidia, so it has to stay that way. The last item would be installing dcgm if I felt like going full "gotta catch 'em all" mode on feature enablement. I'll just assume triton's flash attention isn't used when flash_attention-2 is or it's the same implementation and the person building windows wheels didn't bother with it since xformers already builds it.

Alternatives

Don't not do this?

Additional context

Feature request @ pytorch

Edit: Fixed the title and a cut-off sentence.

NeedsMoar avatar Feb 21 '24 17:02 NeedsMoar

Hi, Thanks for opening this post! At this point, windows support is best effort (it's not something we need internally). If you can make things compatible with windows, we would gladly accept a PR :) Regarding 24 sparsity, the check for library version is there because some old PyTorch nightlies used to have an old version of cusparselt. But in any case we need PyTorch to be built with cusparseLt support - I'm not sure if it's the case in windows (cc @jcaip). That being said, I believe that the cutlass backend should work for 24 sparsity on windows in principle, as it does not require anything from the PyTorch side - it's a bit slower tho.

I think the nccl library sequence_parallel

I don't see a reason why fused sequence parallel wouldn't be available on Windows. @lw any idea? Do we disable the build on windows by mistake? The fastest implementation uses Triton, which does not exist on windows, but we should have fallbacks.

I'll just assume triton's flash attention isn't used when flash_attention-2 is

That's correct. Also enabling triton to work on windows seems to be quite an big project to me...

danthe3rd avatar Feb 22 '24 09:02 danthe3rd

Fused sequence parallel by itself might work on Windows (I'm not aware of any blockers?) but:

  • its fused Triton kernels won't be available, as @danthe3rd mentioned
  • its fallback path uses NCCL, hence that won't be available either
  • for bootstrapping we need a ProcessGroup, but it should be possible to use Gloo here, which should work on Windows?

This said, I'm not sure why the C++ CUDA kernels for fused seqpar appear unavailable in your build. I don't believe we explicitly disable them, but maybe they need to be explicitly enabled in some way?

lw avatar Feb 22 '24 12:02 lw

I believe support for windows support for cusparselt was added here. However have not tried this personally. CUTLASS should be supported.

Also FYI I plan to add better cusparselt detection versioning support to pytorch itself as part of 2.3, so we could land that and remove this function entirely.

jcaip avatar Feb 22 '24 14:02 jcaip

@lw

Fused sequence parallel by itself might work on Windows (I'm not aware of any blockers?) but:

* its fused Triton kernels won't be available, as @danthe3rd mentioned

Somebody got Triton building, impressively. See https://github.com/wkpark/triton/ for artifacts. I think they have a working CI for it since the downloads are build artifacts and a process was visible. That's how I'm able to show memory_efficient_attention.triton_splitKF as available above. I was also doing local builds of flash_attn-2 to get that working while the xformers wheels weren't including it, since everything is more or less drop-in / lazy load for Torch & Xformers. Things that interop need to build against torch but it just tries to load whatever it can find so it can pull in a lot without any real intervention, same with flash_attn-2 loading by xformers... The NVidia libs are all just standard though.

I hadn't messed with trying to build triton because flash attention 2 is the main benefit for stable diffusion (which isn't even the point of having this GPU hardware, just a distraction). You might be able to tell from the linked repo whether or not extensive changes were needed. In my experience with porting early versions of LLVM around minimal real changes are needed for C++ projects that don't use a mess of GCC extensions, didn't use the non-extant "AT&T syntax" Intel assembly0 and already have a modern build system.
I was attempting to get NVidia's TransformerEngine building for Windows instead to do autocast to / from fp8 with the hardware that was (apparently) included in the Ada Lovelace cards as well as hopper. On a 4090 sparse inference at up to ~1.2PFLOPs (which should be usable as soon as the inference starts converging) is kinda attractive. Unfortunately that build is a nightmare of mixed directly called CMake configs, setup.py, Ninja, and not using the torch .cmake files made for that purpose, etc. I used to fix those kind of messes for a living, I think I'll let NVidia do it this time. ;-)

I haven't figured out how to fully test whether torch is recognizing triton yet. It lazy-loads everything else and that was the whole point of somebody building it so I assume it does.

* its fallback path uses NCCL, hence that won't be available either

I think Windows has always used either their software crossfire mechanism since they originally made all their money from gamers or the NVLink bridges when they shipped free with motherboards instead of being $300 add-ons.

* for bootstrapping we need a ProcessGroup, but it should be possible to use Gloo here, which should work on Windows?

Yes. Torch.distributed appears to use it. I'm guessing native cuda needs NCCL regardless of whether other methods are available. If I track down a DLL anywhere I can drop it in and see if it shows up.

It took some tracking down but I managed to dump the availability of various things in torch. Note that I'm not in need of the distributed functions, I was only pointing out that they were the only non-trivial things still not enabled.

>>> torch.distributed.is_available()
True
>>> torch.distributed.is_gloo_available()
True
>>> torch.distributed.is_backend_available("Gloo")
True
>>> torch.distributed.is_nccl_available()
False
>>> torch.distributed.is_backend_available("cuda")
False
>>> torch.distributed.is_ucc_available()
False
>>> torch.distributed.is_mpi_available()
False

This said, I'm not sure why the C++ CUDA kernels for fused seqpar appear unavailable in your build. I don't believe we explicitly disable them, but maybe they need to be explicitly enabled in some way?

Might be the lack of torch support? Otherwise I didn't spot anything but I wasn't looking extensively at that part.

NeedsMoar avatar Feb 29 '24 16:02 NeedsMoar

Hi, Thanks for opening this post! At this point, windows support is best effort (it's not something we need internally). If you can make things compatible with windows, we would gladly accept a PR :) Regarding 24 sparsity, the check for library version is there because some old PyTorch nightlies used to have an old version of cusparselt. But in any case we need PyTorch to be built with cusparseLt support - I'm not sure if it's the case in windows (cc @jcaip). That being said, I believe that the cutlass backend should work for 24 sparsity on windows in principle, as it does not require anything from the PyTorch side - it's a bit slower tho.

I'll see how it pans out, it's a minor change once torch 2.3 lands support if you still need it. I just tested in-place; since it's lazy-load it didn't need a build.

I'll just assume triton's flash attention isn't used when flash_attention-2 is

That's correct.

Good to know.

Also enabling triton to work on windows seems to be quite an big project to me...

See previous post. I'd thought the same thing (at least LLVM builds really fast on Windows) but somebody did it. :-)

NeedsMoar avatar Feb 29 '24 16:02 NeedsMoar

xFormers 0.0.25
memory_efficient_attention.ckF:                    unavailable
memory_efficient_attention.ckB:                    unavailable
memory_efficient_attention.ck_decoderF:            unavailable
memory_efficient_attention.ck_splitKF:             unavailable
memory_efficient_attention.cutlassF:               available
memory_efficient_attention.cutlassB:               available
memory_efficient_attention.decoderF:               available
[email protected]:        available
[email protected]:        available
memory_efficient_attention.smallkF:                available
memory_efficient_attention.smallkB:                available
memory_efficient_attention.triton_splitKF:         available
indexing.scaled_index_addF:                        available
indexing.scaled_index_addB:                        available
indexing.index_select:                             available
sequence_parallel_fused.write_values:              available
sequence_parallel_fused.wait_values:               available
sequence_parallel_fused.cuda_memset_32b_async:     available
sp24.sparse24_sparsify_both_ways:                  available
sp24.sparse24_apply:                               available
sp24.sparse24_apply_dense_output:                  available
sp24._sparse24_gemm:                               available
[email protected]:                        available
swiglu.dual_gemm_silu:                             available
swiglu.gemm_fused_operand_sum:                     available
swiglu.fused.p.cpp:                                available
is_triton_available:                               True
pytorch.version:                                   2.2.1+cu121
pytorch.cuda:                                      available
gpu.compute_capability:                            8.9
gpu.name:                                          NVIDIA GeForce RTX 4090
dcgm_profiler:                                     unavailable
build.info:                                        available
build.cuda_version:                                1201
build.hip_version:                                 None
build.python_version:                              3.11.8
build.torch_version:                               2.2.1+cu121
build.env.TORCH_CUDA_ARCH_LIST:                    5.0+PTX 6.0 6.1 7.0 7.5 8.0+PTX 9.0
build.env.PYTORCH_ROCM_ARCH:                       None
build.env.XFORMERS_BUILD_TYPE:                     Release
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS:        None
build.env.NVCC_FLAGS:                              None
build.env.XFORMERS_PACKAGE_FROM:                   wheel-v0.0.25
build.nvcc_version:                                12.1.66
source.privacy:                                    open source

Nice job whoever got this working + the Windows triton builder, and cheers to AMD for providing flash attention implementations for HIP (the only "unavailables" now) even if Windows doesn't have a build of torch that can use them yet (it's practically a requirement for most of the video models), I had pretty much given up on anything progressing on their side last year but I've been too lazy to sell the 7900XTX while all the 4090s on the market are being scalped at 20-30% markups and Houdini can use it for OpenCL to offload from the RTX4090 so Karma XPU can keep going in the viewport while simulations are baking even if I can't render with it. Now they need to update ProRender for Houdini 20 ;-)

NeedsMoar avatar Mar 23 '24 14:03 NeedsMoar