xformers icon indicating copy to clipboard operation
xformers copied to clipboard

Sequence Parallel Fused Kernel Not Getting Built

Open rajagond opened this issue 4 months ago • 4 comments

Hi, I followed the instructions given here to build and install the latest xformers version. More specifically, I run the below command, but it seems that sequence_parallel_fused kernel is not being built.

pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers

Following is the list of kernels built/available.

root@d4868e6910da:/xformers/xformers# python -m xformers.info
Unable to find python bindings at /usr/local/dcgm/bindings/python3. No data will be captured.
xFormers 0.0.25+075a472.d20240208
memory_efficient_attention.cutlassF:               available
memory_efficient_attention.cutlassB:               available
memory_efficient_attention.decoderF:               available
[email protected]:        available
[email protected]:        available
memory_efficient_attention.smallkF:                available
memory_efficient_attention.smallkB:                available
memory_efficient_attention.tritonflashattF:        unavailable
memory_efficient_attention.tritonflashattB:        unavailable
memory_efficient_attention.triton_splitKF:         available
indexing.scaled_index_addF:                        available
indexing.scaled_index_addB:                        available
indexing.index_select:                             available
sequence_parallel_fused.write_values:              unavailable
sequence_parallel_fused.wait_values:               unavailable
sequence_parallel_fused.cuda_memset_32b_async:     unavailable
sp24.sparse24_sparsify_both_ways:                  available
sp24.sparse24_apply:                               available
sp24.sparse24_apply_dense_output:                  available
sp24._sparse24_gemm:                               available
[email protected]:                        available
swiglu.dual_gemm_silu:                             available
swiglu.gemm_fused_operand_sum:                     available
swiglu.fused.p.cpp:                                available
is_triton_available:                               True
pytorch.version:                                   2.2.0a0+81ea7a4
pytorch.cuda:                                      available
gpu.compute_capability:                            8.0
gpu.name:                                          NVIDIA A100 80GB PCIe
dcgm_profiler:                                     unavailable
build.info:                                        available
build.cuda_version:                                1203
build.python_version:                              3.10.12
build.torch_version:                               2.2.0a0+81ea7a4
build.env.TORCH_CUDA_ARCH_LIST:                    5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX
build.env.XFORMERS_BUILD_TYPE:                     None
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS:        None
build.env.NVCC_FLAGS:                              None
build.env.XFORMERS_PACKAGE_FROM:                   None
build.nvcc_version:                                12.3.107
source.privacy:                                    open source

When I try to run python3 xformers/benchmarks/benchmark_sequence_parallel_fused.py --world-size 2 llama_07B_FFN ag, I encounter the following errors.

LAUNCHED
RANK 0 started
RANK 1 started
Sizes: (2x16384)x(2x5504)x4096
Sizes: (2x16384)x(2x5504)x4096
Process SpawnProcess-2:
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/xformers/xformers/xformers/benchmarks/benchmark_sequence_parallel_fused.py", line 300, in run_one_rank
    run_fused_ag()
  File "/xformers/xformers/xformers/benchmarks/benchmark_sequence_parallel_fused.py", line 222, in run_fused_ag
    gathered_outputs_fused = fused_allgather_and_linear(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/sequence_parallel_fused_ops.py", line 870, in fused_allgather_and_linear
    fused_allgather_and_anything(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/sequence_parallel_fused_ops.py", line 942, in fused_allgather_and_anything
    obj.allgather_and_linear(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/sequence_parallel_fused_ops.py", line 413, in allgather_and_linear
    WaitValues.OPERATOR(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/common.py", line 20, in no_such_operator
    raise RuntimeError(
RuntimeError: No such operator xformers::wait_values - did you forget to build xformers with `python setup.py develop`?
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/xformers/xformers/xformers/benchmarks/benchmark_sequence_parallel_fused.py", line 300, in run_one_rank
    run_fused_ag()
  File "/xformers/xformers/xformers/benchmarks/benchmark_sequence_parallel_fused.py", line 222, in run_fused_ag
    gathered_outputs_fused = fused_allgather_and_linear(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/sequence_parallel_fused_ops.py", line 870, in fused_allgather_and_linear
    fused_allgather_and_anything(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/sequence_parallel_fused_ops.py", line 942, in fused_allgather_and_anything
    obj.allgather_and_linear(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/sequence_parallel_fused_ops.py", line 413, in allgather_and_linear
    WaitValues.OPERATOR(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/common.py", line 20, in no_such_operator
    raise RuntimeError(
RuntimeError: No such operator xformers::wait_values - did you forget to build xformers with `python setup.py develop`?
[rank1]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
Rank 0 exited with 1
Rank 1 exited with 1
JOINED

rajagond avatar Feb 08 '24 05:02 rajagond