Zhilin Yang

Results 39 comments of Zhilin Yang

This seems to be an issue of hyper-parameter tuning. Try using a larger warm up steps, reducing the learning rate, or setting `div_val` to 1.

This seldom happens. With the given hyper-parameters, this actually should not happen. However, when `div_val > 1`, meaning reducing the word embedding dimensionality by `div_val` times for infrequent words, this...

I'm not super familiar with Cloud TPUs, because TPU v3 in general has two configurations. Small slices (like the one you use) have 8 cores per host, and large pods...

And both `NUM_CORE=8` and `NUM_CORE=16` result in the same error when you use more than one host?

It looks like you have fired only one host instead of four. You would probably need to refer to Cloud TPU docs for how to fire multiple hosts. I have...

These are used to implement adaptive softmax since TPUs require fixed-size lengths for all tensors.

Yes! We have a training script for pixelCNN, and we will release it asap. Thanks for pointing this out.

Because we did not find it necessary in terms of performance. The reason is as follows: PixelCNN is used in our framework to prevent generating on-manifold samples, but it is...

Thanks for the good catch, and sorry about any inconvenience. I have fixed the bug, and will update the results soon.