Zhilin Yang comments

Results 39 comments of


                                            Zhilin Yang

Training with wordpiece/bpe vocab

This seems to be an issue of hyper-parameter tuning. Try using a larger warm up steps, reducing the learning rate, or setting `div_val` to 1.

Sensitivity to initial weights causing NANs?

This seldom happens. With the given hyper-parameters, this actually should not happen. However, when `div_val > 1`, meaning reducing the word embedding dimensionality by `div_val` times for infrequent words, this...

Issues with wt103_large_tpu.sh

I'm not super familiar with Cloud TPUs, because TPU v3 in general has two configurations. Small slices (like the one you use) have 8 cores per host, and large pods...

Issues with wt103_large_tpu.sh

And both `NUM_CORE=8` and `NUM_CORE=16` result in the same error when you use more than one host?

Issues with wt103_large_tpu.sh

It looks like you have fired only one host instead of four. You would probably need to refer to Cloud TPU docs for how to fire multiple hosts. I have...

parameters in tf code

These are used to implement adaptive softmax since TPUs require fixed-size lengths for all tensors.

Is it possible to train the pixelCNN with this code base, rather than using the supplied download url?

Yes! We have a training script for pixelCNN, and we will release it asap. Thanks for pointing this out.

Why do you not use pixelCNN for CIFAR experiments?

Because we did not find it necessary in terms of performance. The reason is as follows: PixelCNN is used in our framework to prevent generating on-manifold samples, but it is...

max_common is wrong?

Thanks for the good catch, and sorry about any inconvenience. I have fixed the bug, and will update the results soon.