xlnet icon indicating copy to clipboard operation
xlnet copied to clipboard

Config for TPU pod

Open vochicong opened this issue 4 years ago • 0 comments

I ran train.py on a TPU pod v3-256 and got the following error:

ValueError: TPUConfig.num_shards is not set correctly ....

Found in https://cloud.google.com/tpu/docs/training-on-tpu-pods#providing_the_tpu_name_and_region_to_tpuclusterresolver that

For single device training, you can specify either the TPU name or an IP address, for example: grpc://1.2.3.4:8470. For TPU Pods you must use the TPU name so that TensorFlow can discover the IP addresses of all the hosts available for training distribution.

So, in the case of a TPU pod, setting master doesn't work. I just tried setting cluster and it worked, all 32 hosts in the TPU pod were detected and used correctly.

vochicong avatar Oct 11 '19 06:10 vochicong