Training problem
@ConnorJL Thanks for the great work.
Unfortunately, I found out my training using OpenWebTextCorpus is too slow even for 117M model. The cross entropy loss function decreases rapidly before 10k steps using a batch size of 64. After that it stayed around 3.0. Is this a known phenomenon or is it a dataset problem? I found the loss function in model_fns is not shifted. It should be loss_batch = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=output["logits"][:, :-1],labels=features[:, 1:]) , am I right?
Unfortunately, this is a known phenomena, and I haven't been able to fix it. I perform the shifting of the labels in the input function (it's done in an ugly way, I'd do it differently now, but the effect should be the same). If I didn't shift, the model should converge to 0 loss very rapidly since it's just copying the input. I'm very open to any other ideas of what may be causing this problem. Maybe it is the dataset after all?