DAMO-YOLO
DAMO-YOLO copied to clipboard
[Bug]:
Before Reporting
-
[X] I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
-
[X] I have read the README carefully and no error occured during the installation process. (Otherwise, we recommand that you can ask a question using the Question template) 我已经仔细阅读了README上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)
Search before reporting
OS
Ubuntu
Device
Colab T4
CUDA version
12.2
TensorRT version
No response
Python version
Python 3.10.12
PyTorch version
2.3.0+cu121
torchvision version
0.18.0+cu121
Describe the bug
I'm trying to finetune the model on custom dataset on colab
!cd damo-yolo/ && export PYTHONPATH=/content/damo-yolo && CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch --nproc_per_node=1 tools/train.py --node_rank=0 -f configs/damoyolo_tinynasL20_T.py
and end up with this error:
To Reproduce
- follow the README to custom dataset
- run this command: export PYTHONPATH=/content/damo-yolo && CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch --nproc_per_node=1 tools/train.py --node_rank=0 -f configs/damoyolo_tinynasL20_T.py
Hyper-parameters/Configs
-f configs/damoyolo_tinynasL20_T.py
nproc_per_node=1
Logs
/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank
argument to be set, please
change it to read from os.environ['LOCAL_RANK']
instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
usage: Damo-Yolo train parser [-h] [-f CONFIG_FILE] [--local_rank LOCAL_RANK]
[--tea_config TEA_CONFIG] [--tea_ckpt TEA_CKPT]
...
Damo-Yolo train parser: error: unrecognized arguments: --local-rank=0 --node_rank=0
E0625 18:21:09.782000 137977072861184 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 2) local_rank: 0 (pid: 9042) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 198, in
main()
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 194, in main
launch(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 179, in launch
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
tools/train.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2024-06-25_18:21:09 host : 64f93cf95e56 rank : 0 (local_rank: 0) exitcode : 2 (pid: 9042) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Screenshots
datasets folder
Additional
No response