benchmarks icon indicating copy to clipboard operation
benchmarks copied to clipboard

Unable to train SSD from scratch

Open DEKHTIARJonathan opened this issue 5 years ago • 2 comments

I used the command provided in #310.

I repetitively obtain the same error (single or multiple GPUs):

2019-02-17 16:42:14.539440: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at function_ops.cc:47 : Invalid argument: Argument 1 is out of range.
2019-02-17 16:42:14.541932: W tensorflow/core/kernels/data/generator_dataset_op.cc:79] Error occurred when finalizing GeneratorDataset iterator: Invalid argument: Argument 1 is out of range.

I use this command:

python tf_cnn_benchmarks.py \
  --model=ssd300 \
  --data_name=coco \
  --data_dir=/data/coco \
  --optimizer=momentum \
  --weight_decay=5e-4 \
  --momentum=0.9 \
  --num_gpus=8 \
  --batch_size=64 \
  --use_fp16 \
  --xla_compile \
  --num_epochs=80 \
  --num_eval_epochs=1.9 \
  --num_warmup_batches=0 \
  --eval_during_training_at_specified_steps='7500,10000,11250,12500,12707,15000' \
  --datasets_num_private_threads=100 \
  --num_inter_threads=160 \
  --variable_update=replicated \
  --all_reduce_spec=nccl \
  --gradient_repacking=2 \
  --stop_at_top_1_accuracy=0.212 \
  --loss_type_to_report=base_loss  \
  --single_l2_loss_op \
  --compute_lr_on_cpu \
  --collect_eval_results_async

The TFRecords have been generated using this: https://github.com/tensorflow/tpu/blob/master/tools/datasets/download_and_preprocess_coco.sh

DEKHTIARJonathan avatar Feb 17 '19 16:02 DEKHTIARJonathan

I met the same problem. Have you solved this?

YangFei1990 avatar Apr 24 '19 18:04 YangFei1990

Hello, could you try with the script provided in this comment? There is also a backbone model checkpoint you can use FYI.

haoyuz avatar Jan 17 '20 17:01 haoyuz