Bryan McCann
Bryan McCann
Well that's no good. Let me try running your exact command on my side to see if I get the same thing. Do you know which iteration this first started...
Multi-GPU support broke when I added the task-specific validation metrics for each task. So we'll have to find a way to get around the problems that creep up or we'll...
BSZ also appears to be undefined: https://github.com/kimiyoung/transformer-xl/blob/e619492d7168d55ed14e443af5e56b9599ee469d/tf/scripts/wt103_large_tpu.sh#L93 Perhaps something like this is intended: ``` BSZ=$(($TRAIN_BSZ * $NUM_HOST)) ```
I'm using a TPUv3, but leaving the NUM_CORES=16 https://github.com/kimiyoung/transformer-xl/blob/e619492d7168d55ed14e443af5e56b9599ee469d/tf/scripts/wt103_large_tpu.sh#L10 results in the following error: ``` ValueError: TPUConfig.num_shards is not set correctly. According to TPU system metadata for Tensorflow master (grpc://...0):...
How do you configure multiple hosts? It seems I also need to set NUM_HOSTS=1 or I get similar errors about the num_shards
They are very similar. With 16 cores and 4 hosts: ``` ValueError: TPUConfig.num_shards is not set correctly. According to TPU system metadata for Tensorflow master (grpc://...): **num_replicas should be (8),...
Got it. Thanks to both of you!
TRAIN_BSZ now seems to be used in an inconsistent way. In https://github.com/kimiyoung/transformer-xl/blob/44781ed21dbaec88b280f74d9ae2877f52b492a5/tf/scripts/wt103_large_tpu.sh#L41 it is the per_host_train_batch_size and in https://github.com/kimiyoung/transformer-xl/blob/44781ed21dbaec88b280f74d9ae2877f52b492a5/tf/scripts/wt103_large_tpu.sh#L93 it is used as the aggregate batch size, which is then...
Right, this is only a problem now that I'm experimenting with NUM_HOST > 1.