OpenPCDet
OpenPCDet copied to clipboard
[Training Scripts] Distributed Training Script Python Argument Incorrect.
When running the command sh scripts/dist_train.sh 4 --cfg_file ...
I get the following error.
further instructions
warnings.warn(
[2024-04-22 13:22:53,028] torch.distributed.run: [WARNING]
[2024-04-22 13:22:53,028] torch.distributed.run: [WARNING] *****************************************
[2024-04-22 13:22:53,028] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-04-22 13:22:53,028] torch.distributed.run: [WARNING] *****************************************
usage: train.py [-h] [--cfg_file CFG_FILE] [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--workers WORKERS]
[--extra_tag EXTRA_TAG] [--ckpt CKPT] [--pretrained_model PRETRAINED_MODEL]
[--launcher {none,pytorch,slurm}] [--tcp_port TCP_PORT] [--sync_bn] [--fix_random_seed]
[--ckpt_save_interval CKPT_SAVE_INTERVAL] [--local_rank LOCAL_RANK]
[--max_ckpt_save_num MAX_CKPT_SAVE_NUM] [--merge_all_iters_to_one_epoch] [--set ...]
[--max_waiting_mins MAX_WAITING_MINS] [--start_epoch START_EPOCH]
[--num_epochs_to_eval NUM_EPOCHS_TO_EVAL] [--save_to_file] [--use_tqdm_to_record]
[--logger_iter_interval LOGGER_ITER_INTERVAL] [--ckpt_save_time_interval CKPT_SAVE_TIME_INTERVAL]
[--wo_gpu_stat] [--use_amp]
train.py: error: unrecognized arguments: --local-rank=0
usage: train.py [-h] [--cfg_file CFG_FILE] [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--workers WORKERS]
[--extra_tag EXTRA_TAG] [--ckpt CKPT] [--pretrained_model PRETRAINED_MODEL]
[--launcher {none,pytorch,slurm}] [--tcp_port TCP_PORT] [--sync_bn] [--fix_random_seed]
[--ckpt_save_interval CKPT_SAVE_INTERVAL] [--local_rank LOCAL_RANK]
[--max_ckpt_save_num MAX_CKPT_SAVE_NUM] [--merge_all_iters_to_one_epoch] [--set ...]
[--max_waiting_mins MAX_WAITING_MINS] [--start_epoch START_EPOCH]
[--num_epochs_to_eval NUM_EPOCHS_TO_EVAL] [--save_to_file] [--use_tqdm_to_record]
[--logger_iter_interval LOGGER_ITER_INTERVAL] [--ckpt_save_time_interval CKPT_SAVE_TIME_INTERVAL]
[--wo_gpu_stat] [--use_amp]
train.py: error: unrecognized arguments: --local-rank=3
usage: train.py [-h] [--cfg_file CFG_FILE] [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--workers WORKERS]
[--extra_tag EXTRA_TAG] [--ckpt CKPT] [--pretrained_model PRETRAINED_MODEL]
[--launcher {none,pytorch,slurm}] [--tcp_port TCP_PORT] [--sync_bn] [--fix_random_seed]
[--ckpt_save_interval CKPT_SAVE_INTERVAL] [--local_rank LOCAL_RANK]
[--max_ckpt_save_num MAX_CKPT_SAVE_NUM] [--merge_all_iters_to_one_epoch] [--set ...]
[--max_waiting_mins MAX_WAITING_MINS] [--start_epoch START_EPOCH]
[--num_epochs_to_eval NUM_EPOCHS_TO_EVAL] [--save_to_file] [--use_tqdm_to_record]
[--logger_iter_interval LOGGER_ITER_INTERVAL] [--ckpt_save_time_interval CKPT_SAVE_TIME_INTERVAL]
[--wo_gpu_stat] [--use_amp]
usage: train.py [-h] [--cfg_file CFG_FILE] [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--workers WORKERS]
[--extra_tag EXTRA_TAG] [--ckpt CKPT] [--pretrained_model PRETRAINED_MODEL]
[--launcher {none,pytorch,slurm}] [--tcp_port TCP_PORT] [--sync_bn] [--fix_random_seed]
[--ckpt_save_interval CKPT_SAVE_INTERVAL] [--local_rank LOCAL_RANK]
[--max_ckpt_save_num MAX_CKPT_SAVE_NUM] [--merge_all_iters_to_one_epoch] [--set ...]
[--max_waiting_mins MAX_WAITING_MINS] [--start_epoch START_EPOCH]
[--num_epochs_to_eval NUM_EPOCHS_TO_EVAL] [--save_to_file] [--use_tqdm_to_record]
[--logger_iter_interval LOGGER_ITER_INTERVAL] [--ckpt_save_time_interval CKPT_SAVE_TIME_INTERVAL]
[--wo_gpu_stat] [--use_amp]
train.py: error: unrecognized arguments: --local-rank=2
train.py: error: unrecognized arguments: --local-rank=1
[2024-04-22 13:23:08,052] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid:
It seems that the error comes from the inconsistent parameters that torch.distributed is passing to the train.py.
The argument --local_rank should be renamed to --local-rank.
Suggested fix:
train.py line 36
parser.add_argument('--local-rank', type=int, default=0, help='local rank for distributed training')
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.