Hi, I use tiny-imagenet-200 dataset yo train resnet model, but have occurred the problem:
Traceback (most recent call last):
File "./resnet_ctl_imagenet_main.py", line 268, in
app.run(main)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "./resnet_ctl_imagenet_main.py", line 261, in main
stats = run(flags.FLAGS)
File "./resnet_ctl_imagenet_main.py", line 243, in run
resnet_controller.train(evaluate=not flags_obj.skip_eval)
File "/home/siwei.zm/mlperf/training/image_classification/tensorflow2/tf2_common/training/controller.py", line 258, in train
train_outputs = self.train_fn(steps_per_loop)
File "/home/siwei.zm/mlperf/training/image_classification/tensorflow2/tf2_common/training/standard_runnable.py", line 70, in train
self.train_loop_fn(self.train_iter, num_steps)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 828, in call
result = self._call(*args, **kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 888, in _call
return self._stateless_fn(*args, **kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2943, in call
filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 560, in call
ctx=ctx)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
[[{{node while/body/_1/while/IteratorGetNext}}]] [Op:__inference_loop_fn_24053]
Function call stack:
loop_fn
And I run the code with the script :
python3 ./resnet_ctl_imagenet_main.py
--base_learning_rate=8.5 '
--batch_size=1
--clean
--data_dir=/home/siwei.zm/mlperf/dataset/miniImageNet/tiny-imagenet-200/train
--datasets_num_private_threads=1
--dtype=fp16
--device_warmup_steps=1
--noenable_device_warmup
--enable_eager
--noenable_xla
--epochs_between_evals=1
--noeval_dataset_cache
--eval_offset_epochs=1
--eval_prefetch_batchs=1
--label_smoothing=0.1
--lars_epsilon=0
--log_steps=1
--lr_schedule=polynomial
--model_dir=/home/siwei.zm/mlperf/model/
--momentum=0.9
--num_accumulation_steps=1
--num_classes=200
--num_gpus=1
--optimizer=LARS
--noreport_accuracy_metrics
--single_l2_loss_op
--noskip_eval
--steps_per_loop=100000
--target_accuracy=0.759
--notf_data_experimental_slack
--tf_gpu_thread_mode=gpu_private
--notrace_warmup
--train_epochs=1
--notraining_dataset_cache
--training_prefetch_batchs=1
--nouse_synthetic_data
--warmup_epochs=1
--weight_decay=0.0002
My host is 8-T4-16GB.
I am not good at it. So where is the problem?Can anyone help me?
Thanks very much!
The "OutOfRangeError" suggests that maybe something in the code is still assuming >200 classes. Is there a reason why you can't use the full dataset that we use for MLPerf runs? I'm not sure we have engineering bandwidth to support out-of-scope use cases like this.
Closing because the benchmark has retired.