seq2seq icon indicating copy to clipboard operation
seq2seq copied to clipboard

About distributed training

Open carpedm20 opened this issue 8 years ago • 3 comments
trafficstars

After I read some of the codes, it's hard to fully understand how distributed training works with the code. I guess 'Experiments' is a wrapper that deals with the distributed learning but I'm not sure about this because example scripts doesn't include a command for distributed traing like using 8 parameter sersers as written in the paper (correct me if I'm wrong). Usually distributed tensorflow codes have a key word like ps and worker but I can't find those in the code. Can you clarify this?

By the way, there are lots of great snippets that are usually hard to find from TensorFlow repos. Especially usage of hooking looks pretty useful for profiling and sampling without hurting training. Thanks for the great job!

carpedm20 avatar Mar 29 '17 10:03 carpedm20

Yeah, I agree, there are any good examples of Distributed Training out there. We use a slightly different configuration for distributed training internally (but based on the same code), so I haven't actually run distributed training on the open source version myself, I just know that it should work.

I'll need to spend a few days to write up a guide for that. I think it's pretty high priority so I'll try to do that in the next few days.

dennybritz avatar Mar 29 '17 12:03 dennybritz

I want to know when this guide is ready? Thanks for your effort!

kaizhigaosu avatar Apr 06 '17 03:04 kaizhigaosu

is this guide available somewhere? Thanks!

Sachin19 avatar Nov 15 '17 15:11 Sachin19