mamba
mamba copied to clipboard
triton.runtime.autotuner.OutOfResources: out of resource: shared memory, Required: 254208, Hardware limit: 101376.
First of all, thank you very much for your outstanding work. In my task, I successfully replaced the Mamba1 module and encountered the following problem during the backward of the Mamba2 module. How can I adjust the size of CUDA memory usage? My hardware is RTX 4090, and I would like to know if the above problem is caused by Mamba2 matrix partitioning calculation? The error message is as follows:
Traceback (most recent call last):
File "/home/jiang/anaconda3/envs/pointmamba/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/jiang/anaconda3/envs/pointmamba/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/jiang/.vscode-server/extensions/ms-python.debugpy-2024.8.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
cli.main()
File "/home/jiang/.vscode-server/extensions/ms-python.debugpy-2024.8.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
run()
File "/home/jiang/.vscode-server/extensions/ms-python.debugpy-2024.8.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
runpy.run_path(target, run_name="__main__")
File "/home/jiang/.vscode-server/extensions/ms-python.debugpy-2024.8.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
return _run_module_code(code, init_globals, run_name,
File "/home/jiang/.vscode-server/extensions/ms-python.debugpy-2024.8.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/home/jiang/.vscode-server/extensions/ms-python.debugpy-2024.8.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
exec(code, run_globals)
File "main.py", line 82, in <module>
main()
File "main.py", line 76, in main
finetune(args, config, train_writer, val_writer)
File "/home/jiang/xuyi/PointMamba/tools/runner_finetune.py", line 175, in run_net
_loss.backward()
File "/home/jiang/anaconda3/envs/pointmamba/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/home/jiang/anaconda3/envs/pointmamba/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/jiang/anaconda3/envs/pointmamba/lib/python3.9/site-packages/torch/autograd/function.py", line 288, in apply
return user_fn(self, *args)
File "/home/jiang/anaconda3/envs/pointmamba/lib/python3.9/site-packages/torch/cuda/amp/autocast_mode.py", line 140, in decorate_bwd
return bwd(*args, **kwargs)
File "/home/jiang/anaconda3/envs/pointmamba/lib/python3.9/site-packages/mamba_ssm/ops/triton/ssd_combined.py", line 893, in backward
dx, ddt, dA, dB, dC, dD, _, ddt_bias, dinitial_states = _mamba_chunk_scan_combined_bwd(
File "/home/jiang/anaconda3/envs/pointmamba/lib/python3.9/site-packages/mamba_ssm/ops/triton/ssd_combined.py", line 416, in _mamba_chunk_scan_combined_bwd
dB, ddA_next = _chunk_state_bwd_db(x, dt, dA_cumsum, dstates, seq_idx=seq_idx, B=B, ngroups=ngroups)
File "/home/jiang/anaconda3/envs/pointmamba/lib/python3.9/site-packages/mamba_ssm/ops/triton/ssd_chunk_state.py", line 823, in _chunk_state_bwd_db
_chunk_state_bwd_db_kernel[grid_db](
File "/home/jiang/anaconda3/envs/pointmamba/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 114, in run
ret = self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **kwargs, **config.kwargs)
File "<string>", line 65, in _chunk_state_bwd_db_kernel
File "/home/jiang/anaconda3/envs/pointmamba/lib/python3.9/site-packages/triton/compiler/compiler.py", line 579, in __getattribute__
self._init_handles()
File "/home/jiang/anaconda3/envs/pointmamba/lib/python3.9/site-packages/triton/compiler/compiler.py", line 568, in _init_handles
raise OutOfResources(self.shared, max_shared, "shared memory")
triton.runtime.autotuner.OutOfResources: out of resource: shared memory, Required: 254208, Hardware limit: 101376. Reducing block sizes or num_stages may help.
Can you try reducing d_state (e.g. <= 128) and chunk_size (e.g. try 128).
Can you try reducing d_state (e.g. <= 128) and chunk_size (e.g. try 128).
Thank you very much for your answer. This is my manba2 block parameter. I have reduced the d_state and chunk_Size, but I found that the CUDA occupancy has not changed. I still require 254208. And this issue is not present in forward, only in backward, but if I use Mamba1's network without the above problem.
您能否尝试减少 d_state (例如 <= 128) 和 chunk_size (例如尝试 128)。
非常感谢你的回答,这是我的manba2 block参数,我把d_state和chunk_Size都调小了,但是发现CUDA占用率没变,还是需要254208,而且这个问题不是forward才有,是backward才有,但是如果我用Mamba1的网络就没有上面的问题了。
I have met the same problem with you. The bug occurred when I tried to use Mamba2Simple module. May I ask if you have found a solution to the problem
This is triton's problem. Please uninstall triton and install triton-nigntly. Referenceissues/438 @xypjq @zzzendurance
This is triton's problem. Please uninstall triton and install triton-nigntly. Referenceissues/438 @xypjq @zzzendurance
Thank you very much, hahaha, I found your reply after I left a message here, I have solved the problem, thank you again
This is triton's problem. Please uninstall triton and install triton-nigntly. Referenceissues/438 @xypjq @zzzendurance
Thank you!
