Jonathan Shen
Jonathan Shen
Can you kill the job and restart it? It should resume training. I've never seen it just stop progressing without any error messages, so I have no idea what could...
The docker installs tf-nightly (here https://github.com/tensorflow/lingvo/blob/e649e651e80ec1ad092a4d6777486ace5ea2c3f9/docker/dev.dockerfile#L75) That might be the source of problem if you have a different tensorflow version.
I wonder if it is some kind of threading issue / race condition due to running controller and trainer in the same binary. Internally we always run the jobs as...
I don't think CreateVariable supports variables with None shape. The WeightInit.shape needs to be set explicitly.
The input generator does not prefetch data so that might indeed be the cause.
Sorry I'm not familiar with GPipe. @bignamehyp ?
I'm trying to ask the domain experts but haven't gotten a reply yet. From the training logs 96% correct next step preds the model should be training fine, so there...
It is not possible to use lingvo custom ops with standard serving setups, but fortunately most of the custom ops eg. GenericInput are part of the input pipeline for training...
Lingvo includes both python and c++ code. x_ops.so is c++ code compiled into a shared library for python to use. When you clone the github you are getting the uncompiled...
``` input_tenors = _ToTuple(args) mini_batch_size = input_tenors[0].get_shape().as_list()[p.batch_dim] ``` Looks like that function expects the inputs to have static shapes for the batch dim. Can you make sure the input has...