cloud icon indicating copy to clipboard operation
cloud copied to clipboard

Tensorflow Cloud Tutorial failed to train model

Open karlunho opened this issue 3 years ago • 3 comments

When using the TensorFlow Cloud Tutorial https://www.tensorflow.org/cloud/tutorials/overview , the model fail to train on GCP. I've added the log file and a screenshot of the error . I'm a google employee and ldap is alankho - happy to share my GCP project with whoever is working on the bug.

The failure happened in Google AI Platform when training the model after about 5 hours

downloaded-logs-20210618-235625.csv

karlunho avatar Jun 19 '21 06:06 karlunho

Tried again with TPUs, but also errored out

downloaded-logs-20210619-002835.csv

karlunho avatar Jun 19 '21 07:06 karlunho

I can't tell if AI Platform is using the GPUs properly because when training, the GPU page show that the utilization is 0%, where as the CPU page shows close to 80%. I've uploaded screenshots for the team.

Screen Shot 2021-06-19 at 8 32 57 AM Screen Shot 2021-06-19 at 8 31 19 AM

karlunho avatar Jun 19 '21 15:06 karlunho

Hi my name is Aarushi Soni . I want to contribute to this issue . Is this issue still open ? I am first time contributor . Please guide me through this process.

aarushisoni avatar Sep 07 '22 12:09 aarushisoni