transformer-xl icon indicating copy to clipboard operation
transformer-xl copied to clipboard

Issues with wt103_large_tpu.sh

Open bmccann opened this issue 5 years ago • 12 comments

Seems VLIDA_BSZ should be replaced with VALID_BSZ in a couple places. For example:

https://github.com/kimiyoung/transformer-xl/blob/e619492d7168d55ed14e443af5e56b9599ee469d/tf/scripts/wt103_large_tpu.sh#L42

bmccann avatar Apr 10 '19 21:04 bmccann

BSZ also appears to be undefined:

https://github.com/kimiyoung/transformer-xl/blob/e619492d7168d55ed14e443af5e56b9599ee469d/tf/scripts/wt103_large_tpu.sh#L93

Perhaps something like this is intended:

BSZ=$(($TRAIN_BSZ * $NUM_HOST))

bmccann avatar Apr 10 '19 21:04 bmccann

I'm using a TPUv3, but leaving the NUM_CORES=16

https://github.com/kimiyoung/transformer-xl/blob/e619492d7168d55ed14e443af5e56b9599ee469d/tf/scripts/wt103_large_tpu.sh#L10

results in the following error:

ValueError: TPUConfig.num_shards is not set correctly. According to TPU system metadata for Tensorflow master (grpc://...0): num_replicas should be (8), got (64). For non-model-parallelism, num_replicas should be the total num of TPU cores in the system. For model-parallelism, the total number of TPU cores should be num_cores_per_replica * num_replicas. Please set it accordingly or leave it as `None`

Setting NUM_CORES=8 appears to fix this one.

bmccann avatar Apr 10 '19 22:04 bmccann

I'm not super familiar with Cloud TPUs, because TPU v3 in general has two configurations. Small slices (like the one you use) have 8 cores per host, and large pods have 16 cores per host. Sorry for the confusion.

kimiyoung avatar Apr 10 '19 22:04 kimiyoung

How do you configure multiple hosts? It seems I also need to set NUM_HOSTS=1 or I get similar errors about the num_shards

bmccann avatar Apr 10 '19 23:04 bmccann

And both NUM_CORE=8 and NUM_CORE=16 result in the same error when you use more than one host?

kimiyoung avatar Apr 10 '19 23:04 kimiyoung

They are very similar. With 16 cores and 4 hosts:

ValueError: TPUConfig.num_shards is not set correctly. According to TPU system metadata for Tensorflow master (grpc://...): **num_replicas should be (8), got (64)**. For non-model-parallelism, num_replicas should be the total num of TPU cores in the system. For model-parallelism, the total number of TPU cores should be num_cores_per_replica * num_replicas. Please set it accordingly or leave it as `None`

with 8 cores and 4 hosts:

ValueError: TPUConfig.num_shards is not set correctly. According to TPU system metadata for Tensorflow master (grpc://...): **num_replicas should be (8), got (32)**. For non-model-parallelism, num_replicas should be the total num of TPU cores in the system. For model-parallelism, the total number of TPU cores should be num_cores_per_replica * num_replicas. Please set it accordingly or leave it as `None`

setting the NUM_HOST to 1 and NUM_CORE to 8 gives 8*1 num_replicas and it works out. I'm not sure whether I need to be doing something before running to start up multiple hosts or whether the TPU configuration code should be handling that (fairly new to TPUs as you can probably tell)

bmccann avatar Apr 11 '19 01:04 bmccann

It looks like you have fired only one host instead of four. You would probably need to refer to Cloud TPU docs for how to fire multiple hosts. I have not used Cloud TPUs before.

kimiyoung avatar Apr 11 '19 01:04 kimiyoung

Both v2 and v3 have 1 host if these are donuts (2x2 configuration). Both v2 and v3 have 4 chips and 8 cores (that's 2 cores per chip). Yes you can get bigger configurations/slices (such as 4x4, 8x8, or a full pod) and those will have more than 1 host.

ijkilchenko avatar Apr 11 '19 19:04 ijkilchenko

Got it. Thanks to both of you!

bmccann avatar Apr 11 '19 19:04 bmccann

TRAIN_BSZ now seems to be used in an inconsistent way.

In https://github.com/kimiyoung/transformer-xl/blob/44781ed21dbaec88b280f74d9ae2877f52b492a5/tf/scripts/wt103_large_tpu.sh#L41 it is the per_host_train_batch_size

and in https://github.com/kimiyoung/transformer-xl/blob/44781ed21dbaec88b280f74d9ae2877f52b492a5/tf/scripts/wt103_large_tpu.sh#L93

it is used as the aggregate batch size, which is then divided by the number of hosts at

https://github.com/kimiyoung/transformer-xl/blob/44781ed21dbaec88b280f74d9ae2877f52b492a5/tf/train.py#L344

bmccann avatar Apr 12 '19 23:04 bmccann

I think that the way I've fixed it is I've set NUM_HOST=1 (https://github.com/kimiyoung/transformer-xl/blob/master/tf/scripts/wt103_large_tpu.sh#L9), but I agree, there are inconsistencies that make it hard to reproduce the work. It seems like the intention changes between using a per host batch size and a global batch size depending on what file you use.

ijkilchenko avatar Apr 12 '19 23:04 ijkilchenko

Right, this is only a problem now that I'm experimenting with NUM_HOST > 1.

bmccann avatar Apr 12 '19 23:04 bmccann