benchmarks icon indicating copy to clipboard operation
benchmarks copied to clipboard

Resnet56 with CIFAR-10 produces huge log file

Open tjingrant opened this issue 7 years ago • 10 comments

Hi,

I'm trying to train resnet56 on CIFAR-10 with the following param. However, each time I start the run, it creates a log file of size 1.2G or 2.4G. Somehow I have to constantly restart this training and it quickly grows to unmanageable. In contrast, each log file Resnet50 creates on ImageNet dataset is around 30-40MB which is much more manageable...

Also when I view them with Tensorboard, I can only see projector, not scalars like loss.

Do you have any idea what's going on?

		--model=resnet56 \
		--batch_size=128 \
		--num_epochs=$i \
		--num_gpus=1 \
		--data_dir=/mnt/nfs/cifar-10/cifar-10-batches-py \
		--data_name=cifar10 \
		--variable_update="replicated" \
		--train_dir=resnet56/ \
		--all_reduce_spec=nccl \
		--print_training_accuracy=True \
		--optimizer="momentum" \
		--piecewise_learning_rate_schedule="0.1;250;0.01;375;0.001" \
		--momentum=0.9 \
		--weight_decay=0.0001 \
		--summary_verbosity=1 \
		--save_summaries_steps=200 \
		--save_model_secs=600 \

tjingrant avatar Jan 26 '18 05:01 tjingrant

As you can see the verbosity is set to 1, it's the same value when I trained Resnet50 on ImageNet.

tjingrant avatar Jan 26 '18 05:01 tjingrant

I beleive the issue is that in preprocessing.py, we have a tf.constant node with all the images, which takes up more than a gigabyte of space in the graph.pbtxt that's written out to disk.

This is not a priority for us, so marking as contributions welcome if anyone wants to work on it. To solve this, we should store the images in a variable, and initialize the variable with the images without storing the images in the graph. One of doing this is using a feed dict. Another way would be using tf.data.

reedwm avatar Jan 26 '18 18:01 reedwm

@reedwm thanks, I could look into this one, so the graph definition is also written to the tfevent logs?

My concern is whether this solves huge tfevent file problem...

tjingrant avatar Jan 26 '18 18:01 tjingrant

You're right, the events.out.tfevents is also huge. I'm not sure whether the graph definition is written to it. The issue goes away if I omit --data_dir.

Also, in TensorBoard, I was able to see the scalar by clicking "Inactive", then "Scalars."

reedwm avatar Jan 26 '18 19:01 reedwm

@jsimsa can you shed some light into this situation? Since your name appeared in Cifar10ImagePreprocessor class...

tjingrant avatar Jan 26 '18 19:01 tjingrant

Doing what @reedwm suggests makes sense.

jsimsa avatar Jan 26 '18 22:01 jsimsa

@jsimsa , I wonder if doing what @reedwm suggests could also solve the problem for huge tfevents file?

tjingrant avatar Jan 26 '18 22:01 tjingrant

I have confirmed the tf.constant node is also causing large tfevents file, by changing the line all_images = tf.constant(all_images) to all_images = tf.constant(all_images[:200, ...]) in preprocessing.py and seeing that the size of tfevents was only a few megabytes.

IMO, changing the constant to a variable and initializing it using a feed dict is the easiest way to solve this problem.

reedwm avatar Jan 26 '18 23:01 reedwm

@reedwm thanks a lot for looking into this issue, I'll see what I can do.

tjingrant avatar Jan 27 '18 01:01 tjingrant

@tjingrant If you are only doing one GPU I suggest checking out the official ResNet CIFAR-10 example. It is better maintained and uses more standard TensorFlow concepts.

https://github.com/tensorflow/models/tree/master/official/resnet

There will be a multi-GPU example very soon using estimator. You are welcome to use the benchmark example and I want you to know there is another possibly easier to follow and more fun to use example in the model garden.

tfboyd avatar Jan 29 '18 18:01 tfboyd