simclr icon indicating copy to clipboard operation
simclr copied to clipboard

Reproduce performance

Open onratlgn opened this issue 4 years ago • 4 comments

I'm writing custom code for implementing simclr in the keras via overriding the train step, which is supported after tensorflow>=2.2

The one step of the model takes about ~280ms to execute on TPU(v3-32) for a batch size 128 per replica, for this model. I also tested the provided code (whic uses TPUEstimator) in the repo which takes ~470ms for the given configuration (v3-32).

However, in my configuration TPU is idle while my host machine (DeepLearning VM) prepare the input and one step is about ~2s. Can you please provide which configuration for the host used in the benchmark that claims training for 100 epochs with v3-32 TPU takes 6 hours.

image Trace View of the Repo

image Trace View of Custom Keras model

onratlgn avatar Jul 19 '20 15:07 onratlgn

The 6 hour benchmark was run using the command in readme for pretraining ResNet-50 on ImageNet. However, that was run a while ago before some recent code update. So it is possible that recent code update changes the default hyper-parameters, and the best guess I have here is num_proj_layers is defaulted to 3 instead of 2 before which increases FLOPS. That should only affect slightly though. It is also possible the input pipeline may be a bottleneck and maybe increasing CPU cores could help. How long did it take you to train for 100 epoch?

chentingpc avatar Jul 20 '20 16:07 chentingpc

I change the parameters as in the first paper, num_proj_layers is 2, learning_rate_scaling is 'linear' etc. Still it takes 1 to 2 day and It is not improving even if using larger TPU, so i think it might be about the host device. I'm using 32 core CPU DeepLearningVM for the host. I'm wondering which machine you use in the benchmark as the host device, number of CPU cores, machine type etc. I'm also using GCP for training, so if you use host machine in the platform can you please provide the configuration if it is possible?

onratlgn avatar Jul 20 '20 20:07 onratlgn

unfortunately i don't recall the configuration that i used a few months ago (i remember it was pretty standard but..). i will try to find some time to test it again but cannot guarantee when i will get to that.

chentingpc avatar Jul 24 '20 01:07 chentingpc

Hi, I am having the same problem. I am also using the DeepLearning VM with v3-8. My TPU utilization is 0.21% when running the example code. I'm not sure what the problem is.

pasudyan avatar Dec 02 '20 23:12 pasudyan