llama-recipes
llama-recipes copied to clipboard
FLOPs counter seems doesn't work
🚀 The feature, motivation and pitch
I am able to run the training with the FSDP. But then add the "--flop_counter" flag. It gives the following issue. Could someone take a look at this issue? Is that possible to make the report flop count as default? Thanks
1: The module hierarchy tracking seems to be messed up.Please file a bug to PyTorch.
1: The module hierarchy tracking seems to be messed up.Please file a bug to PyTorch.
1: The module hierarchy tracking seems to be messed up.Please file a bug to PyTorch.
1: The module hierarchy tracking seems to be messed up.Please file a bug to PyTorch.
1: The module hierarchy tracking seems to be messed up.Please file a bug to PyTorch.
1: The module hierarchy tracking seems to be messed up.Please file a bug to PyTorch.
1: The module hierarchy tracking seems to be messed up.Please file a bug to PyTorch.
0: :0:rocdevice.cpp :2875: 1456647898545 us: [pid:3202 tid:0x7f2309bff700] Callback: Queue 0x7ee2fba00000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 234 MB
0: :0:rocdevice.cpp :2875: 1456647904587 us: [pid:3198 tid:0x7f020bbff700] Callback: Queue 0x7f0208200000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 226 MB
1: :0:rocdevice.cpp :2875: 2157204836001 us: [pid:208 tid:0x7fb03b1ff700] Callback: Queue 0x7f7031a00000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 242 MB
1: :0:rocdevice.cpp :2875: 2157204836358 us: [pid:207 tid:0x7f0c92fff700] Callback: Queue 0x7ecc89800000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 246 MB
1: :0:rocdevice.cpp :2875: 2157204838420 us: [pid:203 tid:0x7f59a81ff700] Callback: Queue 0x7f199ea00000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 242 MB
0: :0:rocdevice.cpp :2875: 1456647929027 us: [pid:3201 tid:0x7f2e33bff700] Callback: Queue 0x7f2e30200000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 226 MB
0: :0:rocdevice.cpp :2875: 1456648084561 us: [pid:3203 tid:0x7fac6c1ff700] Callback: Queue 0x7f6c62a00000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 246 MB
1: W0823 17:42:51.540936 131 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 202 closing signal SIGTERM
1: W0823 17:42:51.543727 131 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 203 closing signal SIGTERM
1: W0823 17:42:51.544753 131 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 204 closing signal SIGTERM
1: W0823 17:42:51.547995 131 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 205 closing signal SIGTERM
1: W0823 17:42:51.549960 131 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 206 closing signal SIGTERM
1: W0823 17:42:51.552839 131 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 208 closing signal SIGTERM
1: W0823 17:42:51.553608 131 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 209 closing signal SIGTERM
0: :0:rocdevice.cpp :2875: 1456648361779 us: [pid:3204 tid:0x7f04efdff700] Callback: Queue 0x7ec4e6600000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 242 MB
0: W0823 17:42:51.587928 3126 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 3198 closing signal SIGTERM
0: W0823 17:42:51.588275 3126 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 3199 closing signal SIGTERM
0: W0823 17:42:51.591612 3126 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 3200 closing signal SIGTERM
0: W0823 17:42:51.592847 3126 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 3203 closing signal SIGTERM
0: W0823 17:42:51.595798 3126 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 3204 closing signal SIGTERM
0: W0823 17:42:51.597895 3126 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 3205 closing signal SIGTERM
0: E0823 17:42:52.329508 3126 torch/distributed/elastic/multiprocessing/api.py:863] failed (exitcode: -6) local_rank: 3 (pid: 3201) of binary: /opt/conda/envs/py_3.8/bin/python
0: Traceback (most recent call last):
0: File "/opt/conda/envs/py_3.8/bin/torchrun", line 33, in
Alternatives
No response
Additional context
No response