benchmarks Resnet56 with CIFAR-10 produces huge log file

Hi,

I'm trying to train resnet56 on CIFAR-10 with the following param. However, each time I start the run, it creates a log file of size 1.2G or 2.4G. Somehow I have to constantly restart this training and it quickly grows to unmanageable. In contrast, each log file Resnet50 creates on ImageNet dataset is around 30-40MB which is much more manageable...

Also when I view them with Tensorboard, I can only see projector, not scalars like loss.

Do you have any idea what's going on?

		--model=resnet56 \
		--batch_size=128 \
		--num_epochs=$i \
		--num_gpus=1 \
		--data_dir=/mnt/nfs/cifar-10/cifar-10-batches-py \
		--data_name=cifar10 \
		--variable_update="replicated" \
		--train_dir=resnet56/ \
		--all_reduce_spec=nccl \
		--print_training_accuracy=True \
		--optimizer="momentum" \
		--piecewise_learning_rate_schedule="0.1;250;0.01;375;0.001" \
		--momentum=0.9 \
		--weight_decay=0.0001 \
		--summary_verbosity=1 \
		--save_summaries_steps=200 \
		--save_model_secs=600 \

Jan 26 '18 05:01 tjingrant

As you can see the verbosity is set to 1, it's the same value when I trained Resnet50 on ImageNet.

Jan 26 '18 05:01 tjingrant

I beleive the issue is that in preprocessing.py, we have a tf.constant node with all the images, which takes up more than a gigabyte of space in the graph.pbtxt that's written out to disk.

This is not a priority for us, so marking as contributions welcome if anyone wants to work on it. To solve this, we should store the images in a variable, and initialize the variable with the images without storing the images in the graph. One of doing this is using a feed dict. Another way would be using tf.data.

Jan 26 '18 18:01 reedwm

@reedwm thanks, I could look into this one, so the graph definition is also written to the tfevent logs?

My concern is whether this solves huge tfevent file problem...

Jan 26 '18 18:01 tjingrant

You're right, the events.out.tfevents is also huge. I'm not sure whether the graph definition is written to it. The issue goes away if I omit --data_dir.

Also, in TensorBoard, I was able to see the scalar by clicking "Inactive", then "Scalars."

Jan 26 '18 19:01 reedwm

@jsimsa can you shed some light into this situation? Since your name appeared in Cifar10ImagePreprocessor class...

Jan 26 '18 19:01 tjingrant

Doing what @reedwm suggests makes sense.

Jan 26 '18 22:01 jsimsa

@jsimsa , I wonder if doing what @reedwm suggests could also solve the problem for huge tfevents file?

Jan 26 '18 22:01 tjingrant

I have confirmed the tf.constant node is also causing large tfevents file, by changing the line all_images = tf.constant(all_images) to all_images = tf.constant(all_images[:200, ...]) in preprocessing.py and seeing that the size of tfevents was only a few megabytes.

IMO, changing the constant to a variable and initializing it using a feed dict is the easiest way to solve this problem.

Jan 26 '18 23:01 reedwm

@reedwm thanks a lot for looking into this issue, I'll see what I can do.

Jan 27 '18 01:01 tjingrant

@tjingrant If you are only doing one GPU I suggest checking out the official ResNet CIFAR-10 example. It is better maintained and uses more standard TensorFlow concepts.

https://github.com/tensorflow/models/tree/master/official/resnet

There will be a multi-GPU example very soon using estimator. You are welcome to use the benchmark example and I want you to know there is another possibly easier to follow and more fun to use example in the model garden.

Jan 29 '18 18:01 tfboyd

benchmarks benchmarks copied to clipboard

Resnet56 with CIFAR-10 produces huge log file

benchmarks
benchmarks copied to clipboard