xformers icon indicating copy to clipboard operation
xformers copied to clipboard

Problems installing xFormers in Kaggle notebook

Open probit2011 opened this issue 7 months ago • 3 comments

❓ Questions and Help

I've been trying to install xFormers in a auxilliary Kaggle notebook (in the /kaggle/working folder) which will be used as a utility script in another notebook that will be used to make competition submissions. This is necessary as Kaggle does not yet have xFormers available in its list of Python packages. The version I wanted to install was the CUDA 11.8 version.

After the install and setting up the auxilliary notebook, I tried importing some components (e.g. MLP, fmha) and obtained the following error message (truncated, as it was quite long, but this is the essential bit):

File /kaggle/usr/lib/ubc_ocean_packages/triton/language/math.py:4
      1 import functools
      2 import os
----> 4 from . import core
      7 @functools.lru_cache()
      8 def libdevice_path():
      9     import torch

File /kaggle/usr/lib/ubc_ocean_packages/triton/language/core.py:1376
   1370     rvalue, rindices = reduce((input, index), axis, combine_fn,
   1371                               _builder=_builder, _generator=_generator)
   1372     return rvalue, rindices
   1375 @jit
-> 1376 def minimum(x, y):
   1377     """
   1378     Computes the element-wise minimum of :code:`x` and :code:`y`.
   1379 
   (...)
   1383     :type other: Block
   1384     """
   1385     return where(x < y, x, y)

File /kaggle/usr/lib/ubc_ocean_packages/triton/runtime/jit.py:542, in jit(fn, version, do_not_specialize, debug, noinline, interpret)
    534         return JITFunction(
    535             fn,
    536             version=version,
   (...)
    539             noinline=noinline,
    540         )
    541 if fn is not None:
--> 542     return decorator(fn)
    544 else:
    545     return decorator

File /kaggle/usr/lib/ubc_ocean_packages/triton/runtime/jit.py:534, in jit.<locals>.decorator(fn)
    532     return GridSelector(fn)
    533 else:
--> 534     return JITFunction(
    535         fn,
    536         version=version,
    537         do_not_specialize=do_not_specialize,
    538         debug=debug,
    539         noinline=noinline,
    540     )

File /kaggle/usr/lib/ubc_ocean_packages/triton/runtime/jit.py:433, in JITFunction.__init__(self, fn, version, do_not_specialize, debug, noinline)
    431 self.constexprs = [self.arg_names.index(name) for name, ty in self.__annotations__.items() if 'constexpr' in ty]
    432 # launcher
--> 433 self.run = self._make_launcher()
    434 # re-use docs of wrapped function
    435 self.__doc__ = fn.__doc__

File /kaggle/usr/lib/ubc_ocean_packages/triton/runtime/jit.py:388, in JITFunction._make_launcher(self)
    317         args_signature = ', '.join(name if dflt == inspect._empty else f'{name} = {dflt}' for name, dflt in zip(self.arg_names, self.arg_defaults))
    319         src = f"""
    320 def {self.fn.__name__}({args_signature}, grid=None, num_warps=4, num_stages=3, extern_libs=None, stream=None, warmup=False, device=None, device_type=None):
    321     from ..compiler import compile, CompiledKernel
   (...)
    386       return None
    387 """
--> 388         scope = {"version_key": version_key(),
    389                  "get_cuda_stream": get_cuda_stream,
    390                  "self": self,
    391                  "_spec_of": self._spec_of,
    392                  "_key_of": self._key_of,
    393                  "_device_of": self._device_of,
    394                  "_pinned_memory_of": self._pinned_memory_of,
    395                  "cache": self.cache,
    396                  "__spec__": __spec__,
    397                  "get_backend": get_backend,
    398                  "get_current_device": get_current_device,
    399                  "set_current_device": set_current_device}
    400         exec(src, scope)
    401         return scope[self.fn.__name__]

File /kaggle/usr/lib/ubc_ocean_packages/triton/runtime/jit.py:120, in version_key()
    118         contents += [hashlib.md5(f.read()).hexdigest()]
    119 # ptxas version
--> 120 ptxas = path_to_ptxas()[0]
    121 ptxas_version = hashlib.md5(subprocess.check_output([ptxas, "--version"])).hexdigest()
    122 return '-'.join(TRITON_VERSION) + '-' + ptxas_version + '-' + '-'.join(contents)

File /kaggle/usr/lib/ubc_ocean_packages/triton/common/backend.py:114, in path_to_ptxas()
    112 ptxas_bin = ptxas.split(" ")[0]
    113 if os.path.exists(ptxas_bin) and os.path.isfile(ptxas_bin):
--> 114     result = subprocess.check_output([ptxas_bin, "--version"], stderr=subprocess.STDOUT)
    115     if result is not None:
    116         version = re.search(r".*release (\d+\.\d+).*", result.decode("utf-8"), flags=re.MULTILINE)

File /opt/conda/lib/python3.10/subprocess.py:421, in check_output(timeout, *popenargs, **kwargs)
    418         empty = b''
    419     kwargs['input'] = empty
--> 421 return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
    422            **kwargs).stdout

File /opt/conda/lib/python3.10/subprocess.py:503, in run(input, capture_output, timeout, check, *popenargs, **kwargs)
    500     kwargs['stdout'] = PIPE
    501     kwargs['stderr'] = PIPE
--> 503 with Popen(*popenargs, **kwargs) as process:
    504     try:
    505         stdout, stderr = process.communicate(input, timeout=timeout)

File /opt/conda/lib/python3.10/subprocess.py:971, in Popen.__init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, user, group, extra_groups, encoding, errors, text, umask, pipesize)
    967         if self.text_mode:
    968             self.stderr = io.TextIOWrapper(self.stderr,
    969                     encoding=encoding, errors=errors)
--> 971     self._execute_child(args, executable, preexec_fn, close_fds,
    972                         pass_fds, cwd, env,
    973                         startupinfo, creationflags, shell,
    974                         p2cread, p2cwrite,
    975                         c2pread, c2pwrite,
    976                         errread, errwrite,
    977                         restore_signals,
    978                         gid, gids, uid, umask,
    979                         start_new_session)
    980 except:
    981     # Cleanup if the child failed starting.
    982     for f in filter(None, (self.stdin, self.stdout, self.stderr)):

File /opt/conda/lib/python3.10/subprocess.py:1863, in Popen._execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, gid, gids, uid, umask, start_new_session)
   1861     if errno_num != 0:
   1862         err_msg = os.strerror(errno_num)
-> 1863     raise child_exception_type(errno_num, err_msg, err_filename)
   1864 raise child_exception_type(err_msg)

PermissionError: [Errno 13] Permission denied: '/kaggle/usr/lib/ubc_ocean_packages/triton/common/../third_party/cuda/bin/ptxas'

Here are my questions:

1- Is there a simple way to change the permissions on 'ptxas' to make this work (I tried chmod 600, for example, it didn't work)?

2- As Triton seems to be the culprit, I would have no problems doing without it. In fact, I worked on my model locally on my Windows PC and have been "forcing" xFormers to use CUTLASS when I needed memory_efficient_attention (with op=(fmha.cutlass.FwOp, fmha.cutlass.BwOp) on 8 heads; 2, 4 and 16 heads didn't require specifying op, 8 had an error about the size of my features dimension without it). Would there be a specific instruction I can use in a Linux environment (such as Kaggle's) to force xFormers to use CUTLASS?

3- I also realise that there could be an incompatibility issue between the Kaggle GPU environment (P100 with CUDA 11.4) and the current available versions of xFormers (11.8/12.1). If that was the case, would there still be older versions of xFormers that would be compatible with this type of environment (even without Triton capability)?

Thanks in advance for your answers and guidance!

probit2011 avatar Nov 30 '23 05:11 probit2011

Hi, Can you post the entire stacktrace? In the one you show there is nothing from xformers there. Can you also post the output of the following command?

python -m xformers.info

As a workaround, you can disable triton in xformers entirely by setting this env variable XFORMERS_FORCE_DISABLE_TRITON=1

danthe3rd avatar Nov 30 '23 12:11 danthe3rd

I have re-built the notebook by installing Pytorch and xFormers with the following:

!pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2 --target=/kaggle/working/ --index-url https://download.pytorch.org/whl/cu118
!pip3 install xformers --target=/kaggle/working/ --index-url https://download.pytorch.org/whl/cu118

It seemed to work, as I didn't get the error message mentioned above. However, when I tried to run my model, I got this puzzling error message:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[13], line 37
     35 coords = coords.squeeze(0)
     36 X = tiles.float().to(device=device, non_blocking=True)
---> 37 y_prob, pred, features = model(X, coords)
     38 query_preds.append((image_id.item(), labels[pred.to(device='cpu').item()]))
     39 query_features.append(features.view(-1).to(device='cpu'))

File /kaggle/usr/lib/ubc_ocean_packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

Cell In[5], line 227, in WSINet.forward(self, x, coords)
    225 def forward(self, x, coords):
    226     features = self.encoder(x).unsqueeze(0)
--> 227     features, mask = self.roformer(features, coords)
    228     y_prob, y_hat, attention = self.attention(features)
    230     return y_prob, y_hat, attention

File /kaggle/usr/lib/ubc_ocean_packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

Cell In[5], line 97, in RoFormerLayer.forward(self, x, coords)
     95 q, k = apply_rotary_position_embeddings(self.rope(h, grid_h, grid_w), q, k)
     96 q, k, v = q.reshape(bs, n, self.heads, self.head_dim), k.reshape(bs, n, self.heads, self.head_dim), v.reshape(bs, n, self.heads, self.head_dim)
---> 97 att = fmha.memory_efficient_attention(q, k, v, attn_bias=mask, p = self.dropout, op=(fmha.cutlass.FwOp, fmha.cutlass.BwOp))
     98 o = self.norm2(h + att.reshape(bs, n, h.size(-1)))
     99 ff = self.mlp(o)

File /kaggle/usr/lib/ubc_ocean_packages/xformers/ops/fmha/__init__.py:223, in memory_efficient_attention(query, key, value, attn_bias, p, scale, op)
    116 def memory_efficient_attention(
    117     query: torch.Tensor,
    118     key: torch.Tensor,
   (...)
    124     op: Optional[AttentionOp] = None,
    125 ) -> torch.Tensor:
    126     """Implements the memory-efficient attention mechanism following
    127     `"Self-Attention Does Not Need O(n^2) Memory" <[http://arxiv.org/abs/2112.05682>`_.](http://arxiv.org/abs/2112.05682%3E%60_.%3C/span%3E)
    128 
   (...)
    221     :return: multi-head attention Tensor with shape ``[B, Mq, H, Kv]``
    222     """
--> 223     return _memory_efficient_attention(
    224         Inputs(
    225             query=query, key=key, value=value, p=p, attn_bias=attn_bias, scale=scale
    226         ),
    227         op=op,
    228     )

File /kaggle/usr/lib/ubc_ocean_packages/xformers/ops/fmha/__init__.py:321, in _memory_efficient_attention(inp, op)
    316 def _memory_efficient_attention(
    317     inp: Inputs, op: Optional[AttentionOp] = None
    318 ) -> torch.Tensor:
    319     # fast-path that doesn't require computing the logsumexp for backward computation
    320     if all(x.requires_grad is False for x in [inp.query, inp.key, inp.value]):
--> 321         return _memory_efficient_attention_forward(
    322             inp, op=op[0] if op is not None else None
    323         )
    325     output_shape = inp.normalize_bmhk()
    326     return _fMHA.apply(
    327         op, inp.query, inp.key, inp.value, inp.attn_bias, inp.p, inp.scale
    328     ).reshape(output_shape)

File /kaggle/usr/lib/ubc_ocean_packages/xformers/ops/fmha/__init__.py:339, in _memory_efficient_attention_forward(inp, op)
    337     op = _dispatch_fw(inp, False)
    338 else:
--> 339     _ensure_op_supports_or_raise(ValueError, "memory_efficient_attention", op, inp)
    341 out, *_ = op.apply(inp, needs_gradient=False)
    342 return out.reshape(output_shape)

File /kaggle/usr/lib/ubc_ocean_packages/xformers/ops/fmha/dispatch.py:39, in _ensure_op_supports_or_raise(exc_type, name, op, inp)
     37     if not reasons:
     38         return
---> 39     raise exc_type(
     40         f"""Operator `{name}` does not support inputs:
     41 {textwrap.indent(_format_inputs_description(inp), '     ')}
     42 {_format_not_supported_reasons(op, reasons)}"""
     43     )

ValueError: Operator `memory_efficient_attention` does not support inputs:
     query       : shape=(1, 7040, 8, 96) (torch.float32)
     key         : shape=(1, 7040, 8, 96) (torch.float32)
     value       : shape=(1, 7040, 8, 96) (torch.float32)
     attn_bias   : <class 'xformers.ops.fmha.attn_bias.BlockDiagonalMask'>
     p           : 0.25
`cutlassF` is not supported because:
    xFormers wasn't build with CUDA support
    operator wasn't built - see `python -m xformers.info` for more info

Here is the output from xformers.info:

> !python -m xformers.info
xFormers 0.0.22.post7+cu118
memory_efficient_attention.cutlassF:               unavailable
memory_efficient_attention.cutlassB:               unavailable
memory_efficient_attention.decoderF:               unavailable
[email protected]:         unavailable
[email protected]:         unavailable
memory_efficient_attention.smallkF:                unavailable
memory_efficient_attention.smallkB:                unavailable
memory_efficient_attention.tritonflashattF:        unavailable
memory_efficient_attention.tritonflashattB:        unavailable
memory_efficient_attention.triton_splitKF:         unavailable
indexing.scaled_index_addF:                        available
indexing.scaled_index_addB:                        available
indexing.index_select:                             available
swiglu.dual_gemm_silu:                             unavailable
swiglu.gemm_fused_operand_sum:                     unavailable
swiglu.fused.p.cpp:                                not built
is_triton_available:                               True
pytorch.version:                                   2.0.1+cu118
pytorch.cuda:                                      available
gpu.compute_capability:                            6.0
gpu.name:                                          Tesla P100-PCIE-16GB
build.info:                                        available
build.cuda_version:                                1108
build.python_version:                              3.10.13
build.torch_version:                               2.1.0+cu118
build.env.TORCH_CUDA_ARCH_LIST:                    5.0+PTX 6.0 6.1 7.0 7.5 8.0+PTX 9.0
build.env.XFORMERS_BUILD_TYPE:                     Release
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS:        None
build.env.NVCC_FLAGS:                              None
build.env.XFORMERS_PACKAGE_FROM:                   wheel-v0.0.22.post7
source.privacy:                                    open source

probit2011 avatar Nov 30 '23 13:11 probit2011

I think I found out why xFormers does not install for CUDA in that environment, despite being on a GPU. I checked the install log from the auxilliary notebook and saw the following warnings:

cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
cuml 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
dask-cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.

I even tried to pip install cupy-cuda11x before installing xFormers, but I still obtained the same warnings and the same result. Could it be that the xFormers package does not check for a dependency of a few dependencies (i.e. cupy-cuda11x)?

In any case, I have also raised an issue with Kaggle Docker-python to see whether they can fix the cupy-cuda11x issue. I'll keep you posted.

probit2011 avatar Dec 01 '23 06:12 probit2011