llama-recipes icon indicating copy to clipboard operation
llama-recipes copied to clipboard

FLOPs counter seems doesn't work

Open mathmax12 opened this issue 6 months ago • 8 comments

🚀 The feature, motivation and pitch

I am able to run the training with the FSDP. But then add the "--flop_counter" flag. It gives the following issue. Could someone take a look at this issue? Is that possible to make the report flop count as default? Thanks

1: The module hierarchy tracking seems to be messed up.Please file a bug to PyTorch. 1: The module hierarchy tracking seems to be messed up.Please file a bug to PyTorch. 1: The module hierarchy tracking seems to be messed up.Please file a bug to PyTorch. 1: The module hierarchy tracking seems to be messed up.Please file a bug to PyTorch. 1: The module hierarchy tracking seems to be messed up.Please file a bug to PyTorch. 1: The module hierarchy tracking seems to be messed up.Please file a bug to PyTorch. 1: The module hierarchy tracking seems to be messed up.Please file a bug to PyTorch. 0: :0:rocdevice.cpp :2875: 1456647898545 us: [pid:3202 tid:0x7f2309bff700] Callback: Queue 0x7ee2fba00000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 234 MB 0: :0:rocdevice.cpp :2875: 1456647904587 us: [pid:3198 tid:0x7f020bbff700] Callback: Queue 0x7f0208200000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 226 MB 1: :0:rocdevice.cpp :2875: 2157204836001 us: [pid:208 tid:0x7fb03b1ff700] Callback: Queue 0x7f7031a00000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 242 MB 1: :0:rocdevice.cpp :2875: 2157204836358 us: [pid:207 tid:0x7f0c92fff700] Callback: Queue 0x7ecc89800000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 246 MB 1: :0:rocdevice.cpp :2875: 2157204838420 us: [pid:203 tid:0x7f59a81ff700] Callback: Queue 0x7f199ea00000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 242 MB 0: :0:rocdevice.cpp :2875: 1456647929027 us: [pid:3201 tid:0x7f2e33bff700] Callback: Queue 0x7f2e30200000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 226 MB 0: :0:rocdevice.cpp :2875: 1456648084561 us: [pid:3203 tid:0x7fac6c1ff700] Callback: Queue 0x7f6c62a00000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 246 MB 1: W0823 17:42:51.540936 131 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 202 closing signal SIGTERM 1: W0823 17:42:51.543727 131 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 203 closing signal SIGTERM 1: W0823 17:42:51.544753 131 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 204 closing signal SIGTERM 1: W0823 17:42:51.547995 131 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 205 closing signal SIGTERM 1: W0823 17:42:51.549960 131 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 206 closing signal SIGTERM 1: W0823 17:42:51.552839 131 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 208 closing signal SIGTERM 1: W0823 17:42:51.553608 131 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 209 closing signal SIGTERM 0: :0:rocdevice.cpp :2875: 1456648361779 us: [pid:3204 tid:0x7f04efdff700] Callback: Queue 0x7ec4e6600000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 242 MB 0: W0823 17:42:51.587928 3126 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 3198 closing signal SIGTERM 0: W0823 17:42:51.588275 3126 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 3199 closing signal SIGTERM 0: W0823 17:42:51.591612 3126 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 3200 closing signal SIGTERM 0: W0823 17:42:51.592847 3126 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 3203 closing signal SIGTERM 0: W0823 17:42:51.595798 3126 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 3204 closing signal SIGTERM 0: W0823 17:42:51.597895 3126 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 3205 closing signal SIGTERM 0: E0823 17:42:52.329508 3126 torch/distributed/elastic/multiprocessing/api.py:863] failed (exitcode: -6) local_rank: 3 (pid: 3201) of binary: /opt/conda/envs/py_3.8/bin/python 0: Traceback (most recent call last): 0: File "/opt/conda/envs/py_3.8/bin/torchrun", line 33, in 0: sys.exit(load_entry_point('torch==2.5.0a0+git10344d7', 'console_scripts', 'torchrun')()) 0: File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper 0: return f(*args, **kwargs) 0: File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/distributed/run.py", line 919, in main 0: run(args) 0: File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/distributed/run.py", line 910, in run 0: elastic_launch( 0: File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 138, in call 0: return launch_agent(self._config, self._entrypoint, list(args)) 0: File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent 0: raise ChildFailedError( 0: torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Alternatives

No response

Additional context

No response

mathmax12 avatar Aug 23 '24 18:08 mathmax12