long-context-attention icon indicating copy to clipboard operation
long-context-attention copied to clipboard

SpargeAttention error

Open feifeibear opened this issue 9 months ago • 0 comments

torchrun --nproc_per_node=4 ./test/test_hybrid_attn.py --sp_ulysses_degree 4 --attn_impl sparse_sage --tune_mode

attn_processorattn_processor is an instance of SparseAttentionMeansim, but it is empty now.is an instance of SparseAttentionMeansim, but it is empty now.

attn_processorattn_processor.is_sparse is a substate_dict of attn_processor, we will load it.attn_processor.is_sparse is a substate_dict of attn_processor, we will load it.

attn_processoris an instance of SparseAttentionMeansim, but it is empty now. is an instance of SparseAttentionMeansim, but it is empty now.attn_processor.is_sparse is a substate_dict of attn_processor, we will load it.

attn_processor.is_sparse is a substate_dict of attn_processor, we will load it. [rank2]: Traceback (most recent call last): [rank2]: File "/file_system/fjr/code/long-context-attention/./test/test_hybrid_attn.py", line 172, in [rank2]: load_sparse_attention_state_dict(usp_attn, saved_state_dict, multigpu=True, verbose=True) [rank2]: File "/file_system/fjr/miniconda3/envs/xdit/lib/python3.10/site-packages/spas_sage_attn-0.1.0-py3.10-linux-x86_64.egg/spas_sage_attn/autotune.py", line 36, in load_sparse_attention_state_dict [rank2]: sv= sv.to(device=v.device) [rank2]: File "/file_system/fjr/miniconda3/envs/xdit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1928, in getattr [rank2]: raise AttributeError( [rank2]: AttributeError: 'SparseAttentionMeansim' object has no attribute 'device' [rank0]: Traceback (most recent call last): [rank0]: File "/file_system/fjr/code/long-context-attention/./test/test_hybrid_attn.py", line 172, in [rank0]: load_sparse_attention_state_dict(usp_attn, saved_state_dict, multigpu=True, verbose=True) [rank0]: File "/file_system/fjr/miniconda3/envs/xdit/lib/python3.10/site-packages/spas_sage_attn-0.1.0-py3.10-linux-x86_64.egg/spas_sage_attn/autotune.py", line 36, in load_sparse_attention_state_dict [rank0]: sv= sv.to(device=v.device) [rank0]: File "/file_system/fjr/miniconda3/envs/xdit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1928, in getattr [rank0]: raise AttributeError( [rank0]: AttributeError: 'SparseAttentionMeansim' object has no attribute 'device' [rank1]: Traceback (most recent call last): [rank1]: File "/file_system/fjr/code/long-context-attention/./test/test_hybrid_attn.py", line 172, in [rank1]: load_sparse_attention_state_dict(usp_attn, saved_state_dict, multigpu=True, verbose=True) [rank1]: File "/file_system/fjr/miniconda3/envs/xdit/lib/python3.10/site-packages/spas_sage_attn-0.1.0-py3.10-linux-x86_64.egg/spas_sage_attn/autotune.py", line 36, in load_sparse_attention_state_dict [rank1]: sv= sv.to(device=v.device) [rank1]: File "/file_system/fjr/miniconda3/envs/xdit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1928, in getattr [rank1]: raise AttributeError( [rank1]: AttributeError: 'SparseAttentionMeansim' object has no attribute 'device' [rank3]: Traceback (most recent call last): [rank3]: File "/file_system/fjr/code/long-context-attention/./test/test_hybrid_attn.py", line 172, in [rank3]: load_sparse_attention_state_dict(usp_attn, saved_state_dict, multigpu=True, verbose=True) [rank3]: File "/file_system/fjr/miniconda3/envs/xdit/lib/python3.10/site-packages/spas_sage_attn-0.1.0-py3.10-linux-x86_64.egg/spas_sage_attn/autotune.py", line 36, in load_sparse_attention_state_dict [rank3]: sv= sv.to(device=v.device) [rank3]: File "/file_system/fjr/miniconda3/envs/xdit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1928, in getattr [rank3]: raise AttributeError( [rank3]: AttributeError: 'SparseAttentionMeansim' object has no attribute 'device' [rank0]:[W407 11:20:12.908119039 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) [rank3]:[W407 11:20:13.636513957 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) [rank2]:[W407 11:20:13.651571147 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) [rank1]:[W407 11:20:13.756881791 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) W0407 11:20:13.578000 457792 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 457918 closing signal SIGTERM E0407 11:20:13.742000 457792 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 457917) of binary: /file_system/fjr/miniconda3/envs/xdit/bin/python Traceback (most recent call last): File "/file_system/fjr/miniconda3/envs/xdit/bin/torchrun", line 8, in sys.exit(main()) File "/file_system/fjr/miniconda3/envs/xdit/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper return f(*args, **kwargs) File "/file_system/fjr/miniconda3/envs/xdit/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main run(args) File "/file_system/fjr/miniconda3/envs/xdit/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run elastic_launch( File "/file_system/fjr/miniconda3/envs/xdit/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/file_system/fjr/miniconda3/envs/xdit/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./test/test_hybrid_attn.py FAILED

Failures: [1]: time : 2025-04-07_11:20:13 host : localhost rank : 2 (local_rank: 2) exitcode : 1 (pid: 457919) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2025-04-07_11:20:13 host : localhost rank : 3 (local_rank: 3) exitcode : 1 (pid: 457920) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2025-04-07_11:20:13 host : localhost rank : 0 (local_rank: 0) exitcode : 1 (pid: 457917) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

feifeibear avatar Apr 07 '25 03:04 feifeibear