training OutOfRangeError: End of sequence

Hi, I use tiny-imagenet-200 dataset yo train resnet model, but have occurred the problem:

Traceback (most recent call last): File "./resnet_ctl_imagenet_main.py", line 268, in app.run(main) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run _run_main(main, args) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "./resnet_ctl_imagenet_main.py", line 261, in main stats = run(flags.FLAGS) File "./resnet_ctl_imagenet_main.py", line 243, in run resnet_controller.train(evaluate=not flags_obj.skip_eval) File "/home/siwei.zm/mlperf/training/image_classification/tensorflow2/tf2_common/training/controller.py", line 258, in train train_outputs = self.train_fn(steps_per_loop) File "/home/siwei.zm/mlperf/training/image_classification/tensorflow2/tf2_common/training/standard_runnable.py", line 70, in train self.train_loop_fn(self.train_iter, num_steps) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 828, in call result = self._call(*args, **kwds) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 888, in _call return self._stateless_fn(*args, **kwds) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2943, in call filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat ctx, args, cancellation_manager=cancellation_manager)) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 560, in call ctx=ctx) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute inputs, attrs, num_outputs) tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence [[{{node while/body/_1/while/IteratorGetNext}}]] [Op:__inference_loop_fn_24053] Function call stack: loop_fn

And I run the code with the script :

python3 ./resnet_ctl_imagenet_main.py
--base_learning_rate=8.5 ' --batch_size=1
--clean
--data_dir=/home/siwei.zm/mlperf/dataset/miniImageNet/tiny-imagenet-200/train
--datasets_num_private_threads=1
--dtype=fp16
--device_warmup_steps=1
--noenable_device_warmup
--enable_eager
--noenable_xla
--epochs_between_evals=1
--noeval_dataset_cache
--eval_offset_epochs=1
--eval_prefetch_batchs=1
--label_smoothing=0.1
--lars_epsilon=0
--log_steps=1
--lr_schedule=polynomial
--model_dir=/home/siwei.zm/mlperf/model/
--momentum=0.9
--num_accumulation_steps=1
--num_classes=200
--num_gpus=1
--optimizer=LARS
--noreport_accuracy_metrics
--single_l2_loss_op
--noskip_eval
--steps_per_loop=100000
--target_accuracy=0.759
--notf_data_experimental_slack
--tf_gpu_thread_mode=gpu_private
--notrace_warmup
--train_epochs=1
--notraining_dataset_cache
--training_prefetch_batchs=1
--nouse_synthetic_data
--warmup_epochs=1
--weight_decay=0.0002

My host is 8-T4-16GB.

I am not good at it. So where is the problem?Can anyone help me? Thanks very much!

Nov 12 '21 07:11 missximon

@sgpyc can you advise?

Nov 16 '22 19:11 johntran-nv

The "OutOfRangeError" suggests that maybe something in the code is still assuming >200 classes. Is there a reason why you can't use the full dataset that we use for MLPerf runs? I'm not sure we have engineering bandwidth to support out-of-scope use cases like this.

Nov 29 '22 22:11 johntran-nv

Closing because the benchmark has retired.

Jul 25 '24 16:07 hiwotadese