Jonathan Shen comments

Results 85 comments of


                                            Jonathan Shen

Training time

Actually I think that might be the right speed. We get ~1.1s per step on 16 P100s which would be ~5s per step on 4 P100s and P100 is supposedly...

Training time

Hm, that's not good news... Unfortunately it's very hard to debug performance issues remotely. Please try out some tensorflow profiling options to see if the GPUs are being utilized efficiently...

Training time

It seems like there is a dip in the GPU graph every 6-8 seconds, which also corresponds to your step time. Could it be maybe the CPU/disk cannot keep up...

Training time

Sorry, I tried checking with a few people on our side and we didn't have any other hypothesis for why it's slower for you :(

Training time

I believe that depends on whether you are running in sync mode (the default) or async mode. In sync mode all GPUs need to complete their computation before the step...

why we append an all-zero frame

I asked around, and got this > It improves WSJ / LibriSpeech for LAS as it tell the model explicitly, "end of utterance", in the early days. > > But...

Lingvo builds unexpected target

This is known due to moving to python3 support in bazel. We'll wait a few weeks then change everything to python3 by default and that should fix things.

A problem of distributed training

Sorry, we are aware that run_distributed has some problems, but don't have the resources to fix it at the moment. If someone is able to create a pull request that...

A problem of distributed training

if you have 3 workers with 8 gpus each you should have worker_gpus=8 worker_replicas=3 and the rest default

A problem of distributed training

It needs to be the same as your physical cluster setup. worker_replicas is the number of training worker jobs you are running. worker_gpus is the number of gpus each training...