Zhilin Yang

Results 15 comments of Zhilin Yang

@aditya-malte Thanks for your contribution. It would be nice if you could do the following: - merge your changes with the original `configure_tpu` function to support all the cases; -...

You need to set `init_checkpoint` to be `model/xlnet_cased_L-24_H-1024_A-16/xlnet_model.ckpt` and `model_dir` to be a new separate folder.

You can't do eval without training because there are task-specific parameters (the output layer).

Well I think it's possible but does not make too much sense.

Does it work if reduce the batch size, sequence length, or whatever reduces memory usage?

Afaik, bfloat16 should be used on TPUs.

Thanks for your interest. This is under our consideration.

Good question. In fact we are using the implementation that you just mentioned. Sorry about the confusion.

Yes, tokens 123678 have bidirectional attention and they attend to all the other tokens, while tokens 4 and 5 use an auto-regressive factorization conditioned on 123678. This is what we...

There is no memory overhead, because during inference there is no permutation. In fact, due to the use of relative positional encodings, you can increase `seqlen` to be larger than...