OpenPCDet icon indicating copy to clipboard operation
OpenPCDet copied to clipboard

[Training Scripts] Distributed Training Script Python Argument Incorrect.

Open tjtanaa opened this issue 1 year ago • 1 comments

When running the command sh scripts/dist_train.sh 4 --cfg_file ... I get the following error.

further instructions

  warnings.warn(
[2024-04-22 13:22:53,028] torch.distributed.run: [WARNING] 
[2024-04-22 13:22:53,028] torch.distributed.run: [WARNING] *****************************************
[2024-04-22 13:22:53,028] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-04-22 13:22:53,028] torch.distributed.run: [WARNING] *****************************************
usage: train.py [-h] [--cfg_file CFG_FILE] [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--workers WORKERS]
                [--extra_tag EXTRA_TAG] [--ckpt CKPT] [--pretrained_model PRETRAINED_MODEL]
                [--launcher {none,pytorch,slurm}] [--tcp_port TCP_PORT] [--sync_bn] [--fix_random_seed]
                [--ckpt_save_interval CKPT_SAVE_INTERVAL] [--local_rank LOCAL_RANK]
                [--max_ckpt_save_num MAX_CKPT_SAVE_NUM] [--merge_all_iters_to_one_epoch] [--set ...]
                [--max_waiting_mins MAX_WAITING_MINS] [--start_epoch START_EPOCH]
                [--num_epochs_to_eval NUM_EPOCHS_TO_EVAL] [--save_to_file] [--use_tqdm_to_record]
                [--logger_iter_interval LOGGER_ITER_INTERVAL] [--ckpt_save_time_interval CKPT_SAVE_TIME_INTERVAL]
                [--wo_gpu_stat] [--use_amp]
train.py: error: unrecognized arguments: --local-rank=0
usage: train.py [-h] [--cfg_file CFG_FILE] [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--workers WORKERS]
                [--extra_tag EXTRA_TAG] [--ckpt CKPT] [--pretrained_model PRETRAINED_MODEL]
                [--launcher {none,pytorch,slurm}] [--tcp_port TCP_PORT] [--sync_bn] [--fix_random_seed]
                [--ckpt_save_interval CKPT_SAVE_INTERVAL] [--local_rank LOCAL_RANK]
                [--max_ckpt_save_num MAX_CKPT_SAVE_NUM] [--merge_all_iters_to_one_epoch] [--set ...]
                [--max_waiting_mins MAX_WAITING_MINS] [--start_epoch START_EPOCH]
                [--num_epochs_to_eval NUM_EPOCHS_TO_EVAL] [--save_to_file] [--use_tqdm_to_record]
                [--logger_iter_interval LOGGER_ITER_INTERVAL] [--ckpt_save_time_interval CKPT_SAVE_TIME_INTERVAL]
                [--wo_gpu_stat] [--use_amp]
train.py: error: unrecognized arguments: --local-rank=3
usage: train.py [-h] [--cfg_file CFG_FILE] [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--workers WORKERS]
                [--extra_tag EXTRA_TAG] [--ckpt CKPT] [--pretrained_model PRETRAINED_MODEL]
                [--launcher {none,pytorch,slurm}] [--tcp_port TCP_PORT] [--sync_bn] [--fix_random_seed]
                [--ckpt_save_interval CKPT_SAVE_INTERVAL] [--local_rank LOCAL_RANK]
                [--max_ckpt_save_num MAX_CKPT_SAVE_NUM] [--merge_all_iters_to_one_epoch] [--set ...]
                [--max_waiting_mins MAX_WAITING_MINS] [--start_epoch START_EPOCH]
                [--num_epochs_to_eval NUM_EPOCHS_TO_EVAL] [--save_to_file] [--use_tqdm_to_record]
                [--logger_iter_interval LOGGER_ITER_INTERVAL] [--ckpt_save_time_interval CKPT_SAVE_TIME_INTERVAL]
                [--wo_gpu_stat] [--use_amp]
usage: train.py [-h] [--cfg_file CFG_FILE] [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--workers WORKERS]
                [--extra_tag EXTRA_TAG] [--ckpt CKPT] [--pretrained_model PRETRAINED_MODEL]
                [--launcher {none,pytorch,slurm}] [--tcp_port TCP_PORT] [--sync_bn] [--fix_random_seed]
                [--ckpt_save_interval CKPT_SAVE_INTERVAL] [--local_rank LOCAL_RANK]
                [--max_ckpt_save_num MAX_CKPT_SAVE_NUM] [--merge_all_iters_to_one_epoch] [--set ...]
                [--max_waiting_mins MAX_WAITING_MINS] [--start_epoch START_EPOCH]
                [--num_epochs_to_eval NUM_EPOCHS_TO_EVAL] [--save_to_file] [--use_tqdm_to_record]
                [--logger_iter_interval LOGGER_ITER_INTERVAL] [--ckpt_save_time_interval CKPT_SAVE_TIME_INTERVAL]
                [--wo_gpu_stat] [--use_amp]
train.py: error: unrecognized arguments: --local-rank=2
train.py: error: unrecognized arguments: --local-rank=1
[2024-04-22 13:23:08,052] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 

It seems that the error comes from the inconsistent parameters that torch.distributed is passing to the train.py. The argument --local_rank should be renamed to --local-rank.

Suggested fix: train.py line 36

    parser.add_argument('--local-rank', type=int, default=0, help='local rank for distributed training')

tjtanaa avatar Apr 22 '24 05:04 tjtanaa

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] avatar May 23 '24 01:05 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Jun 06 '24 01:06 github-actions[bot]