anpark
anpark
env: paddle 1.5.1 with fleet, 1 ps, 2 trainers, use dataset ps and worker0 stop success, but worker1 coredump trainer failed, exit_code=134 pure virtual method called terminate called without an...
https://github.com/tensorflow/benchmarks/blob/2389369f6b5c9d3241676a728b450e47482966c0/scripts/tf_cnn_benchmarks/benchmark_cnn.py#L1579 why not change global_step update op for #199 ? Make sure only chief worker can add global_step like tf.train.SyncReplicasOptimizer @alsrgv @reedwm
HI, if i have 5 parts input dataset in hdfs, then if i use 5 workers to train 2 epochs i think worker 0 read part-0 2epochs, worker 1 read...