Bryan McCann

Results 9 comments of Bryan McCann

Well that's no good. Let me try running your exact command on my side to see if I get the same thing. Do you know which iteration this first started...

Multi-GPU support broke when I added the task-specific validation metrics for each task. So we'll have to find a way to get around the problems that creep up or we'll...

BSZ also appears to be undefined: https://github.com/kimiyoung/transformer-xl/blob/e619492d7168d55ed14e443af5e56b9599ee469d/tf/scripts/wt103_large_tpu.sh#L93 Perhaps something like this is intended: ``` BSZ=$(($TRAIN_BSZ * $NUM_HOST)) ```

I'm using a TPUv3, but leaving the NUM_CORES=16 https://github.com/kimiyoung/transformer-xl/blob/e619492d7168d55ed14e443af5e56b9599ee469d/tf/scripts/wt103_large_tpu.sh#L10 results in the following error: ``` ValueError: TPUConfig.num_shards is not set correctly. According to TPU system metadata for Tensorflow master (grpc://...0):...

How do you configure multiple hosts? It seems I also need to set NUM_HOSTS=1 or I get similar errors about the num_shards

They are very similar. With 16 cores and 4 hosts: ``` ValueError: TPUConfig.num_shards is not set correctly. According to TPU system metadata for Tensorflow master (grpc://...): **num_replicas should be (8),...

Got it. Thanks to both of you!

TRAIN_BSZ now seems to be used in an inconsistent way. In https://github.com/kimiyoung/transformer-xl/blob/44781ed21dbaec88b280f74d9ae2877f52b492a5/tf/scripts/wt103_large_tpu.sh#L41 it is the per_host_train_batch_size and in https://github.com/kimiyoung/transformer-xl/blob/44781ed21dbaec88b280f74d9ae2877f52b492a5/tf/scripts/wt103_large_tpu.sh#L93 it is used as the aggregate batch size, which is then...

Right, this is only a problem now that I'm experimenting with NUM_HOST > 1.