mamba triton.runtime.autotuner.OutOfResources: out of resource: shared memory, Required: 254208, Hardware limit: 101376.

trafficstars

First of all, thank you very much for your outstanding work. In my task, I successfully replaced the Mamba1 module and encountered the following problem during the backward of the Mamba2 module. How can I adjust the size of CUDA memory usage? My hardware is RTX 4090, and I would like to know if the above problem is caused by Mamba2 matrix partitioning calculation? The error message is as follows：

Traceback (most recent call last):
  File "/home/jiang/anaconda3/envs/pointmamba/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/jiang/anaconda3/envs/pointmamba/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/jiang/.vscode-server/extensions/ms-python.debugpy-2024.8.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
    cli.main()
  File "/home/jiang/.vscode-server/extensions/ms-python.debugpy-2024.8.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
    run()
  File "/home/jiang/.vscode-server/extensions/ms-python.debugpy-2024.8.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
    runpy.run_path(target, run_name="__main__")
  File "/home/jiang/.vscode-server/extensions/ms-python.debugpy-2024.8.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/home/jiang/.vscode-server/extensions/ms-python.debugpy-2024.8.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/jiang/.vscode-server/extensions/ms-python.debugpy-2024.8.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
    exec(code, run_globals)
  File "main.py", line 82, in <module>
    main()
  File "main.py", line 76, in main
    finetune(args, config, train_writer, val_writer)
  File "/home/jiang/xuyi/PointMamba/tools/runner_finetune.py", line 175, in run_net
    _loss.backward()
  File "/home/jiang/anaconda3/envs/pointmamba/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/jiang/anaconda3/envs/pointmamba/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/jiang/anaconda3/envs/pointmamba/lib/python3.9/site-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
  File "/home/jiang/anaconda3/envs/pointmamba/lib/python3.9/site-packages/torch/cuda/amp/autocast_mode.py", line 140, in decorate_bwd
    return bwd(*args, **kwargs)
  File "/home/jiang/anaconda3/envs/pointmamba/lib/python3.9/site-packages/mamba_ssm/ops/triton/ssd_combined.py", line 893, in backward
    dx, ddt, dA, dB, dC, dD, _, ddt_bias, dinitial_states = _mamba_chunk_scan_combined_bwd(
  File "/home/jiang/anaconda3/envs/pointmamba/lib/python3.9/site-packages/mamba_ssm/ops/triton/ssd_combined.py", line 416, in _mamba_chunk_scan_combined_bwd
    dB, ddA_next = _chunk_state_bwd_db(x, dt, dA_cumsum, dstates, seq_idx=seq_idx, B=B, ngroups=ngroups)
  File "/home/jiang/anaconda3/envs/pointmamba/lib/python3.9/site-packages/mamba_ssm/ops/triton/ssd_chunk_state.py", line 823, in _chunk_state_bwd_db
    _chunk_state_bwd_db_kernel[grid_db](
  File "/home/jiang/anaconda3/envs/pointmamba/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 114, in run
    ret = self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **kwargs, **config.kwargs)
  File "<string>", line 65, in _chunk_state_bwd_db_kernel
  File "/home/jiang/anaconda3/envs/pointmamba/lib/python3.9/site-packages/triton/compiler/compiler.py", line 579, in __getattribute__
    self._init_handles()
  File "/home/jiang/anaconda3/envs/pointmamba/lib/python3.9/site-packages/triton/compiler/compiler.py", line 568, in _init_handles
    raise OutOfResources(self.shared, max_shared, "shared memory")
triton.runtime.autotuner.OutOfResources: out of resource: shared memory, Required: 254208, Hardware limit: 101376. Reducing block sizes or num_stages may help.

Jul 15 '24 06:07 xypjq

Can you try reducing d_state (e.g. <= 128) and chunk_size (e.g. try 128).

Jul 15 '24 06:07 tridao

Can you try reducing d_state (e.g. <= 128) and chunk_size (e.g. try 128).

Thank you very much for your answer. This is my manba2 block parameter. I have reduced the d_state and chunk_Size, but I found that the CUDA occupancy has not changed. I still require 254208. And this issue is not present in forward, only in backward, but if I use Mamba1's network without the above problem.

Jul 15 '24 06:07 xypjq

您能否尝试减少 d_state (例如 <= 128) 和 chunk_size (例如尝试 128)。

非常感谢你的回答，这是我的manba2 block参数，我把d_state和chunk_Size都调小了，但是发现CUDA占用率没变，还是需要254208，而且这个问题不是forward才有，是backward才有，但是如果我用Mamba1的网络就没有上面的问题了。

I have met the same problem with you. The bug occurred when I tried to use Mamba2Simple module. May I ask if you have found a solution to the problem

Jul 18 '24 02:07 zzzendurance

This is triton's problem. Please uninstall triton and install triton-nigntly. Referenceissues/438 @xypjq @zzzendurance

Jul 18 '24 06:07 hhhhpaaa

This is triton's problem. Please uninstall triton and install triton-nigntly. Referenceissues/438 @xypjq @zzzendurance

Thank you very much, hahaha, I found your reply after I left a message here, I have solved the problem, thank you again

Jul 18 '24 14:07 zzzendurance

This is triton's problem. Please uninstall triton and install triton-nigntly. Referenceissues/438 @xypjq @zzzendurance

Thank you!

Jul 28 '24 06:07 Anri-Lombard

mamba mamba copied to clipboard

triton.runtime.autotuner.OutOfResources: out of resource: shared memory, Required: 254208, Hardware limit: 101376.

mamba
mamba copied to clipboard