xformers
xformers copied to clipboard
Problems installing xFormers in Kaggle notebook
❓ Questions and Help
I've been trying to install xFormers in a auxilliary Kaggle notebook (in the /kaggle/working folder) which will be used as a utility script in another notebook that will be used to make competition submissions. This is necessary as Kaggle does not yet have xFormers available in its list of Python packages. The version I wanted to install was the CUDA 11.8 version.
After the install and setting up the auxilliary notebook, I tried importing some components (e.g. MLP, fmha) and obtained the following error message (truncated, as it was quite long, but this is the essential bit):
File /kaggle/usr/lib/ubc_ocean_packages/triton/language/math.py:4
1 import functools
2 import os
----> 4 from . import core
7 @functools.lru_cache()
8 def libdevice_path():
9 import torch
File /kaggle/usr/lib/ubc_ocean_packages/triton/language/core.py:1376
1370 rvalue, rindices = reduce((input, index), axis, combine_fn,
1371 _builder=_builder, _generator=_generator)
1372 return rvalue, rindices
1375 @jit
-> 1376 def minimum(x, y):
1377 """
1378 Computes the element-wise minimum of :code:`x` and :code:`y`.
1379
(...)
1383 :type other: Block
1384 """
1385 return where(x < y, x, y)
File /kaggle/usr/lib/ubc_ocean_packages/triton/runtime/jit.py:542, in jit(fn, version, do_not_specialize, debug, noinline, interpret)
534 return JITFunction(
535 fn,
536 version=version,
(...)
539 noinline=noinline,
540 )
541 if fn is not None:
--> 542 return decorator(fn)
544 else:
545 return decorator
File /kaggle/usr/lib/ubc_ocean_packages/triton/runtime/jit.py:534, in jit.<locals>.decorator(fn)
532 return GridSelector(fn)
533 else:
--> 534 return JITFunction(
535 fn,
536 version=version,
537 do_not_specialize=do_not_specialize,
538 debug=debug,
539 noinline=noinline,
540 )
File /kaggle/usr/lib/ubc_ocean_packages/triton/runtime/jit.py:433, in JITFunction.__init__(self, fn, version, do_not_specialize, debug, noinline)
431 self.constexprs = [self.arg_names.index(name) for name, ty in self.__annotations__.items() if 'constexpr' in ty]
432 # launcher
--> 433 self.run = self._make_launcher()
434 # re-use docs of wrapped function
435 self.__doc__ = fn.__doc__
File /kaggle/usr/lib/ubc_ocean_packages/triton/runtime/jit.py:388, in JITFunction._make_launcher(self)
317 args_signature = ', '.join(name if dflt == inspect._empty else f'{name} = {dflt}' for name, dflt in zip(self.arg_names, self.arg_defaults))
319 src = f"""
320 def {self.fn.__name__}({args_signature}, grid=None, num_warps=4, num_stages=3, extern_libs=None, stream=None, warmup=False, device=None, device_type=None):
321 from ..compiler import compile, CompiledKernel
(...)
386 return None
387 """
--> 388 scope = {"version_key": version_key(),
389 "get_cuda_stream": get_cuda_stream,
390 "self": self,
391 "_spec_of": self._spec_of,
392 "_key_of": self._key_of,
393 "_device_of": self._device_of,
394 "_pinned_memory_of": self._pinned_memory_of,
395 "cache": self.cache,
396 "__spec__": __spec__,
397 "get_backend": get_backend,
398 "get_current_device": get_current_device,
399 "set_current_device": set_current_device}
400 exec(src, scope)
401 return scope[self.fn.__name__]
File /kaggle/usr/lib/ubc_ocean_packages/triton/runtime/jit.py:120, in version_key()
118 contents += [hashlib.md5(f.read()).hexdigest()]
119 # ptxas version
--> 120 ptxas = path_to_ptxas()[0]
121 ptxas_version = hashlib.md5(subprocess.check_output([ptxas, "--version"])).hexdigest()
122 return '-'.join(TRITON_VERSION) + '-' + ptxas_version + '-' + '-'.join(contents)
File /kaggle/usr/lib/ubc_ocean_packages/triton/common/backend.py:114, in path_to_ptxas()
112 ptxas_bin = ptxas.split(" ")[0]
113 if os.path.exists(ptxas_bin) and os.path.isfile(ptxas_bin):
--> 114 result = subprocess.check_output([ptxas_bin, "--version"], stderr=subprocess.STDOUT)
115 if result is not None:
116 version = re.search(r".*release (\d+\.\d+).*", result.decode("utf-8"), flags=re.MULTILINE)
File /opt/conda/lib/python3.10/subprocess.py:421, in check_output(timeout, *popenargs, **kwargs)
418 empty = b''
419 kwargs['input'] = empty
--> 421 return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
422 **kwargs).stdout
File /opt/conda/lib/python3.10/subprocess.py:503, in run(input, capture_output, timeout, check, *popenargs, **kwargs)
500 kwargs['stdout'] = PIPE
501 kwargs['stderr'] = PIPE
--> 503 with Popen(*popenargs, **kwargs) as process:
504 try:
505 stdout, stderr = process.communicate(input, timeout=timeout)
File /opt/conda/lib/python3.10/subprocess.py:971, in Popen.__init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, user, group, extra_groups, encoding, errors, text, umask, pipesize)
967 if self.text_mode:
968 self.stderr = io.TextIOWrapper(self.stderr,
969 encoding=encoding, errors=errors)
--> 971 self._execute_child(args, executable, preexec_fn, close_fds,
972 pass_fds, cwd, env,
973 startupinfo, creationflags, shell,
974 p2cread, p2cwrite,
975 c2pread, c2pwrite,
976 errread, errwrite,
977 restore_signals,
978 gid, gids, uid, umask,
979 start_new_session)
980 except:
981 # Cleanup if the child failed starting.
982 for f in filter(None, (self.stdin, self.stdout, self.stderr)):
File /opt/conda/lib/python3.10/subprocess.py:1863, in Popen._execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, gid, gids, uid, umask, start_new_session)
1861 if errno_num != 0:
1862 err_msg = os.strerror(errno_num)
-> 1863 raise child_exception_type(errno_num, err_msg, err_filename)
1864 raise child_exception_type(err_msg)
PermissionError: [Errno 13] Permission denied: '/kaggle/usr/lib/ubc_ocean_packages/triton/common/../third_party/cuda/bin/ptxas'
Here are my questions:
1- Is there a simple way to change the permissions on 'ptxas' to make this work (I tried chmod 600, for example, it didn't work)?
2- As Triton seems to be the culprit, I would have no problems doing without it. In fact, I worked on my model locally on my Windows PC and have been "forcing" xFormers to use CUTLASS when I needed memory_efficient_attention (with op=(fmha.cutlass.FwOp, fmha.cutlass.BwOp) on 8 heads; 2, 4 and 16 heads didn't require specifying op, 8 had an error about the size of my features dimension without it). Would there be a specific instruction I can use in a Linux environment (such as Kaggle's) to force xFormers to use CUTLASS?
3- I also realise that there could be an incompatibility issue between the Kaggle GPU environment (P100 with CUDA 11.4) and the current available versions of xFormers (11.8/12.1). If that was the case, would there still be older versions of xFormers that would be compatible with this type of environment (even without Triton capability)?
Thanks in advance for your answers and guidance!
Hi, Can you post the entire stacktrace? In the one you show there is nothing from xformers there. Can you also post the output of the following command?
python -m xformers.info
As a workaround, you can disable triton in xformers entirely by setting this env variable XFORMERS_FORCE_DISABLE_TRITON=1
I have re-built the notebook by installing Pytorch and xFormers with the following:
!pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2 --target=/kaggle/working/ --index-url https://download.pytorch.org/whl/cu118
!pip3 install xformers --target=/kaggle/working/ --index-url https://download.pytorch.org/whl/cu118
It seemed to work, as I didn't get the error message mentioned above. However, when I tried to run my model, I got this puzzling error message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[13], line 37
35 coords = coords.squeeze(0)
36 X = tiles.float().to(device=device, non_blocking=True)
---> 37 y_prob, pred, features = model(X, coords)
38 query_preds.append((image_id.item(), labels[pred.to(device='cpu').item()]))
39 query_features.append(features.view(-1).to(device='cpu'))
File /kaggle/usr/lib/ubc_ocean_packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
Cell In[5], line 227, in WSINet.forward(self, x, coords)
225 def forward(self, x, coords):
226 features = self.encoder(x).unsqueeze(0)
--> 227 features, mask = self.roformer(features, coords)
228 y_prob, y_hat, attention = self.attention(features)
230 return y_prob, y_hat, attention
File /kaggle/usr/lib/ubc_ocean_packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
Cell In[5], line 97, in RoFormerLayer.forward(self, x, coords)
95 q, k = apply_rotary_position_embeddings(self.rope(h, grid_h, grid_w), q, k)
96 q, k, v = q.reshape(bs, n, self.heads, self.head_dim), k.reshape(bs, n, self.heads, self.head_dim), v.reshape(bs, n, self.heads, self.head_dim)
---> 97 att = fmha.memory_efficient_attention(q, k, v, attn_bias=mask, p = self.dropout, op=(fmha.cutlass.FwOp, fmha.cutlass.BwOp))
98 o = self.norm2(h + att.reshape(bs, n, h.size(-1)))
99 ff = self.mlp(o)
File /kaggle/usr/lib/ubc_ocean_packages/xformers/ops/fmha/__init__.py:223, in memory_efficient_attention(query, key, value, attn_bias, p, scale, op)
116 def memory_efficient_attention(
117 query: torch.Tensor,
118 key: torch.Tensor,
(...)
124 op: Optional[AttentionOp] = None,
125 ) -> torch.Tensor:
126 """Implements the memory-efficient attention mechanism following
127 `"Self-Attention Does Not Need O(n^2) Memory" <[http://arxiv.org/abs/2112.05682>`_.](http://arxiv.org/abs/2112.05682%3E%60_.%3C/span%3E)
128
(...)
221 :return: multi-head attention Tensor with shape ``[B, Mq, H, Kv]``
222 """
--> 223 return _memory_efficient_attention(
224 Inputs(
225 query=query, key=key, value=value, p=p, attn_bias=attn_bias, scale=scale
226 ),
227 op=op,
228 )
File /kaggle/usr/lib/ubc_ocean_packages/xformers/ops/fmha/__init__.py:321, in _memory_efficient_attention(inp, op)
316 def _memory_efficient_attention(
317 inp: Inputs, op: Optional[AttentionOp] = None
318 ) -> torch.Tensor:
319 # fast-path that doesn't require computing the logsumexp for backward computation
320 if all(x.requires_grad is False for x in [inp.query, inp.key, inp.value]):
--> 321 return _memory_efficient_attention_forward(
322 inp, op=op[0] if op is not None else None
323 )
325 output_shape = inp.normalize_bmhk()
326 return _fMHA.apply(
327 op, inp.query, inp.key, inp.value, inp.attn_bias, inp.p, inp.scale
328 ).reshape(output_shape)
File /kaggle/usr/lib/ubc_ocean_packages/xformers/ops/fmha/__init__.py:339, in _memory_efficient_attention_forward(inp, op)
337 op = _dispatch_fw(inp, False)
338 else:
--> 339 _ensure_op_supports_or_raise(ValueError, "memory_efficient_attention", op, inp)
341 out, *_ = op.apply(inp, needs_gradient=False)
342 return out.reshape(output_shape)
File /kaggle/usr/lib/ubc_ocean_packages/xformers/ops/fmha/dispatch.py:39, in _ensure_op_supports_or_raise(exc_type, name, op, inp)
37 if not reasons:
38 return
---> 39 raise exc_type(
40 f"""Operator `{name}` does not support inputs:
41 {textwrap.indent(_format_inputs_description(inp), ' ')}
42 {_format_not_supported_reasons(op, reasons)}"""
43 )
ValueError: Operator `memory_efficient_attention` does not support inputs:
query : shape=(1, 7040, 8, 96) (torch.float32)
key : shape=(1, 7040, 8, 96) (torch.float32)
value : shape=(1, 7040, 8, 96) (torch.float32)
attn_bias : <class 'xformers.ops.fmha.attn_bias.BlockDiagonalMask'>
p : 0.25
`cutlassF` is not supported because:
xFormers wasn't build with CUDA support
operator wasn't built - see `python -m xformers.info` for more info
Here is the output from xformers.info:
> !python -m xformers.info
xFormers 0.0.22.post7+cu118
memory_efficient_attention.cutlassF: unavailable
memory_efficient_attention.cutlassB: unavailable
memory_efficient_attention.decoderF: unavailable
[email protected]: unavailable
[email protected]: unavailable
memory_efficient_attention.smallkF: unavailable
memory_efficient_attention.smallkB: unavailable
memory_efficient_attention.tritonflashattF: unavailable
memory_efficient_attention.tritonflashattB: unavailable
memory_efficient_attention.triton_splitKF: unavailable
indexing.scaled_index_addF: available
indexing.scaled_index_addB: available
indexing.index_select: available
swiglu.dual_gemm_silu: unavailable
swiglu.gemm_fused_operand_sum: unavailable
swiglu.fused.p.cpp: not built
is_triton_available: True
pytorch.version: 2.0.1+cu118
pytorch.cuda: available
gpu.compute_capability: 6.0
gpu.name: Tesla P100-PCIE-16GB
build.info: available
build.cuda_version: 1108
build.python_version: 3.10.13
build.torch_version: 2.1.0+cu118
build.env.TORCH_CUDA_ARCH_LIST: 5.0+PTX 6.0 6.1 7.0 7.5 8.0+PTX 9.0
build.env.XFORMERS_BUILD_TYPE: Release
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS: None
build.env.NVCC_FLAGS: None
build.env.XFORMERS_PACKAGE_FROM: wheel-v0.0.22.post7
source.privacy: open source
I think I found out why xFormers does not install for CUDA in that environment, despite being on a GPU. I checked the install log from the auxilliary notebook and saw the following warnings:
cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
cuml 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
dask-cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
I even tried to pip install cupy-cuda11x before installing xFormers, but I still obtained the same warnings and the same result. Could it be that the xFormers package does not check for a dependency of a few dependencies (i.e. cupy-cuda11x)?
In any case, I have also raised an issue with Kaggle Docker-python to see whether they can fix the cupy-cuda11x issue. I'll keep you posted.