benchmarks
benchmarks copied to clipboard
Unable to train SSD from scratch
I used the command provided in #310.
I repetitively obtain the same error (single or multiple GPUs):
2019-02-17 16:42:14.539440: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at function_ops.cc:47 : Invalid argument: Argument 1 is out of range.
2019-02-17 16:42:14.541932: W tensorflow/core/kernels/data/generator_dataset_op.cc:79] Error occurred when finalizing GeneratorDataset iterator: Invalid argument: Argument 1 is out of range.
I use this command:
python tf_cnn_benchmarks.py \
--model=ssd300 \
--data_name=coco \
--data_dir=/data/coco \
--optimizer=momentum \
--weight_decay=5e-4 \
--momentum=0.9 \
--num_gpus=8 \
--batch_size=64 \
--use_fp16 \
--xla_compile \
--num_epochs=80 \
--num_eval_epochs=1.9 \
--num_warmup_batches=0 \
--eval_during_training_at_specified_steps='7500,10000,11250,12500,12707,15000' \
--datasets_num_private_threads=100 \
--num_inter_threads=160 \
--variable_update=replicated \
--all_reduce_spec=nccl \
--gradient_repacking=2 \
--stop_at_top_1_accuracy=0.212 \
--loss_type_to_report=base_loss \
--single_l2_loss_op \
--compute_lr_on_cpu \
--collect_eval_results_async
The TFRecords have been generated using this: https://github.com/tensorflow/tpu/blob/master/tools/datasets/download_and_preprocess_coco.sh
I met the same problem. Have you solved this?
Hello, could you try with the script provided in this comment? There is also a backbone model checkpoint you can use FYI.