electra icon indicating copy to clipboard operation
electra copied to clipboard

Training on TPU got stuck

Open stefan-it opened this issue 5 years ago • 5 comments

Hi,

I've seen some strange behavior when training on TPU (v3-8 from TFRC). After 600k steps (using the default parameters for a base model) training got stuck. I could see two different types of error messages:

2020-12-29 06:59:31.171522: W
tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC
failed with status = "Unavailable: Socket c
losed"and grpc_error_string =
"{"created":"@1609225171.171066909","description":"Error received from
peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Socket
closed","grpc_status":14}", maybe retrying the RPC
                 2020-12-29 07:00:48.641639: W
tensorflow/core/distributed_runtime/rpc/grpc_session.cc:370]
GrpcSession::ListDevices will initialize the session with an empty graph
and other defaults because the session has not yet been created.

Another case: after resuming from a checkpoint, training also got stuck, but only the following message is shown:

2020-12-28 02:05:51.354876: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:370] GrpcSession::ListDevices will initialize the session 
with an empty graph and other defaults because the session has not yet been created.     

The first error message comes when using the recently introduced ELECTRIC approach, second error message comes when training with ELECTRA objective. After these error messages, the TPU is going into IDLE state.

I haven't seen this kind of error message when training previous ELECTRA models (like for Turkish. It could be related to new ELECTRIC code modifications, but I'm currently not 100% sure.

I'll update the issue whenever there are new insights!

stefan-it avatar Dec 29 '20 18:12 stefan-it

The VM instance was created with:

gcloud compute instances create stefan-2 --zone=europe-west4-a --machine-type=n1-standard-2 --image-project=ml-images --image-family=tf-1-15 --scopes=cloud-platform

TPU was created with:

gcloud compute tpus create electra-2 --zone=europe-west4-a --accelerator-type=v3-8 --network=default --range=192.168.3.0/29 --version=1.15

So in both cases, TensorFlow 1.15 is used.

stefan-it avatar Dec 29 '20 18:12 stefan-it

Hey @stefan-it - good luck with getting an answer. I think the google research guys have pretty much abandoned this repo.

@Phil1108 I think we never had a problem like that - right? Do you see a solution or reason for this?

PhilipMay avatar Dec 29 '20 19:12 PhilipMay

Hi @PhilipMay in the described setup, I was using one VM and two (separate) cloned ELECTRA repos to train two models (with disjoint training data and GCP buckets). Maybe that causes problems with grpc connections.

I'm currently running only one ELECTRA training in the VM and resuming of the training process is stable for ~ 3 days now. Maybe this could be a solution :thinking:

stefan-it avatar Jan 03 '21 01:01 stefan-it

Hi @stefan-it , I have experienced various kinds of stuck TPUs. As a solution I normally use this library https://github.com/shawwn/tpunicorn which automatically restarts the training every few hours (it is nearly impossible to debug these TPUs, but recreating and resmuning solves the problem quite often). One additionally point is that you might need to change the TPU TF-version explicitly to 1.15.3 to avoid using versions like 1.15.1/.2 as they seem to include some bugfixes for TPUs in the 1.15.3 version

Phil1108 avatar Jan 04 '21 14:01 Phil1108

Hi @Phil1108 ,

thanks for your tips! Things were getting better when I used only one TPU training per VM and using a more recent version of the 1.15 branch (I used 1.15.4). I initally thought that specifying 1.15 would always use the latest version, but this is not the case (I checked it via cli option).

Workaround for my recent release language models (Europeana ConvBERT and Turkish ConvBERT) was to use v-32 TPUs, so that training time can be reduced to ~3.5 days.

stefan-it avatar Mar 18 '21 10:03 stefan-it