xformers
xformers copied to clipboard
Sequence Parallel Fused Kernel Not Getting Built
Hi, I followed the instructions given here to build and install the latest xformers version. More specifically, I run the below command, but it seems that sequence_parallel_fused kernel is not being built.
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
Following is the list of kernels built/available.
root@d4868e6910da:/xformers/xformers# python -m xformers.info
Unable to find python bindings at /usr/local/dcgm/bindings/python3. No data will be captured.
xFormers 0.0.25+075a472.d20240208
memory_efficient_attention.cutlassF: available
memory_efficient_attention.cutlassB: available
memory_efficient_attention.decoderF: available
[email protected]: available
[email protected]: available
memory_efficient_attention.smallkF: available
memory_efficient_attention.smallkB: available
memory_efficient_attention.tritonflashattF: unavailable
memory_efficient_attention.tritonflashattB: unavailable
memory_efficient_attention.triton_splitKF: available
indexing.scaled_index_addF: available
indexing.scaled_index_addB: available
indexing.index_select: available
sequence_parallel_fused.write_values: unavailable
sequence_parallel_fused.wait_values: unavailable
sequence_parallel_fused.cuda_memset_32b_async: unavailable
sp24.sparse24_sparsify_both_ways: available
sp24.sparse24_apply: available
sp24.sparse24_apply_dense_output: available
sp24._sparse24_gemm: available
[email protected]: available
swiglu.dual_gemm_silu: available
swiglu.gemm_fused_operand_sum: available
swiglu.fused.p.cpp: available
is_triton_available: True
pytorch.version: 2.2.0a0+81ea7a4
pytorch.cuda: available
gpu.compute_capability: 8.0
gpu.name: NVIDIA A100 80GB PCIe
dcgm_profiler: unavailable
build.info: available
build.cuda_version: 1203
build.python_version: 3.10.12
build.torch_version: 2.2.0a0+81ea7a4
build.env.TORCH_CUDA_ARCH_LIST: 5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX
build.env.XFORMERS_BUILD_TYPE: None
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS: None
build.env.NVCC_FLAGS: None
build.env.XFORMERS_PACKAGE_FROM: None
build.nvcc_version: 12.3.107
source.privacy: open source
When I try to run python3 xformers/benchmarks/benchmark_sequence_parallel_fused.py --world-size 2 llama_07B_FFN ag
, I encounter the following errors.
LAUNCHED
RANK 0 started
RANK 1 started
Sizes: (2x16384)x(2x5504)x4096
Sizes: (2x16384)x(2x5504)x4096
Process SpawnProcess-2:
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/xformers/xformers/xformers/benchmarks/benchmark_sequence_parallel_fused.py", line 300, in run_one_rank
run_fused_ag()
File "/xformers/xformers/xformers/benchmarks/benchmark_sequence_parallel_fused.py", line 222, in run_fused_ag
gathered_outputs_fused = fused_allgather_and_linear(
File "/usr/local/lib/python3.10/dist-packages/xformers/ops/sequence_parallel_fused_ops.py", line 870, in fused_allgather_and_linear
fused_allgather_and_anything(
File "/usr/local/lib/python3.10/dist-packages/xformers/ops/sequence_parallel_fused_ops.py", line 942, in fused_allgather_and_anything
obj.allgather_and_linear(
File "/usr/local/lib/python3.10/dist-packages/xformers/ops/sequence_parallel_fused_ops.py", line 413, in allgather_and_linear
WaitValues.OPERATOR(
File "/usr/local/lib/python3.10/dist-packages/xformers/ops/common.py", line 20, in no_such_operator
raise RuntimeError(
RuntimeError: No such operator xformers::wait_values - did you forget to build xformers with `python setup.py develop`?
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/xformers/xformers/xformers/benchmarks/benchmark_sequence_parallel_fused.py", line 300, in run_one_rank
run_fused_ag()
File "/xformers/xformers/xformers/benchmarks/benchmark_sequence_parallel_fused.py", line 222, in run_fused_ag
gathered_outputs_fused = fused_allgather_and_linear(
File "/usr/local/lib/python3.10/dist-packages/xformers/ops/sequence_parallel_fused_ops.py", line 870, in fused_allgather_and_linear
fused_allgather_and_anything(
File "/usr/local/lib/python3.10/dist-packages/xformers/ops/sequence_parallel_fused_ops.py", line 942, in fused_allgather_and_anything
obj.allgather_and_linear(
File "/usr/local/lib/python3.10/dist-packages/xformers/ops/sequence_parallel_fused_ops.py", line 413, in allgather_and_linear
WaitValues.OPERATOR(
File "/usr/local/lib/python3.10/dist-packages/xformers/ops/common.py", line 20, in no_such_operator
raise RuntimeError(
RuntimeError: No such operator xformers::wait_values - did you forget to build xformers with `python setup.py develop`?
[rank1]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
Rank 0 exited with 1
Rank 1 exited with 1
JOINED