tf_repos
tf_repos copied to clipboard
分布式训练
再请教您一个问题 work1 一直在等待INFO:tensorflow:Waiting 1800.000000 secs before starting eval. WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.
work'-0 日志的一部分 INFO:tensorflow:Saving checkpoints for 25076 into /workspace/wlc/model_dir/model.ckpt. INFO:tensorflow:global_step/sec: 7.5244 E0712 18:25:49.778093 ProjectorPluginIsActiveThread saver.py:1817] Couldn't match files for checkpoint /workspace/wlc/model_dir/model.ckpt-25076 INFO:tensorflow:loss = 84.10749, average_loss = 0.65708977 (31.580 sec) INFO:tensorflow:loss = 84.10749, step = 25285 (31.580 sec) E0712 18:26:20.777756 ProjectorPluginIsActiveThread saver.py:1817] Couldn't match files for checkpoint /workspace/wlc/model_dir/model.ckpt-25076 INFO:tensorflow:loss = 81.15384, average_loss = 0.63401437 (25.918 sec)
work-1 一直在等待
TensorBoard 1.6.0 at http://tensorflow-wanglianchen-144-16-worker-1-0grc9:6006 (Press CTRL+C to quit) INFO:tensorflow:TF_CONFIG environment variable: {'cluster': {'worker': ['tensorflow-wanglianchen-144-16-worker-2:2222'], 'ps': ['tensorflow-wanglianchen-144-16-ps-0:2222'], 'chief': ['tensorflow-wanglianchen-144-16-worker-0:2222']}, 'task': {'type': 'evaluator', 'index': 0}} INFO:tensorflow:Using config: {'_num_worker_replicas': 0, '_num_ps_replicas': 0, '_global_id_in_cluster': None, '_master': '', '_save_checkpoints_steps': 1000, '_session_config': device_count { key: "CPU" value: 1 } device_count { key: "GPU" } , '_keep_checkpoint_every_n_hours': 10000, '_save_summary_steps': 1000, '_keep_checkpoint_max': 5, '_log_step_count_steps': 1000, '_service': None, '_save_checkpoints_secs': None, '_is_chief': False, '_tf_random_seed': None, '_model_dir': '/workspace/wlc/model_dir/', '_evaluation_master': '', '_task_id': 0, '_cluster_spec': , '_task_type': 'evaluator'} INFO:tensorflow:Waiting 1800.000000 secs before starting eval. WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready. INFO:tensorflow:Waiting 1799.999588 secs before starting next eval run. WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready. INFO:tensorflow:Waiting 1799.999654 secs before starting next eval run. WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready. INFO:tensorflow:Waiting 1799.999693 secs before starting next eval run. WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready. INFO:tensorflow:Waiting 1799.999667 secs before starting next eval run. WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready. INFO:tensorflow:Waiting 1799.999685 secs before starting next eval run.
work-2运行成功
INFO:tensorflow:loss = 84.003555, average_loss = 0.6562778 (26.016 sec) INFO:tensorflow:loss = 84.003555, step = 25914 (26.016 sec) INFO:tensorflow:Loss for final step: 84.82182. ps_host ['tensorflow-wanglianchen-144-16-ps-0:2222'] worker_host ['tensorflow-wanglianchen-144-16-worker-2:2222'] chief_hosts ['tensorflow-wanglianchen-144-16-worker-0:2222'] {"task": {"index": 0, "type": "worker"}, "cluster": {"ps": ["tensorflow-wanglianchen-144-16-ps-0:2222"], "worker": ["tensorflow-wanglianchen-144-16-worker-2:2222"], "chief": ["tensorflow-wanglianchen-144-16-worker-0:2222"]}} model_type:wide_deep train_samples_num:3000000 Parsing /workspace/wlc/wide_deep_dist/data/train.csv 1.0hours task train success. modeldir=/workspace/wlc,modelname=model_dir
ps—0 日志 start checkWorkerIsFinish TensorBoard 1.6.0 at http://tensorflow-wanglianchen-144-16-ps-0-jrngn:6006 (Press CTRL+C to quit) INFO:tensorflow:TF_CONFIG environment variable: {'cluster': {'worker': ['tensorflow-wanglianchen-144-16-worker-2:2222'], 'ps': ['tensorflow-wanglianchen-144-16-ps-0:2222'], 'chief': ['tensorflow-wanglianchen-144-16-worker-0:2222']}, 'task': {'type': 'ps', 'index': 0}} INFO:tensorflow:Using config: {'_cluster_spec': , '_task_id': 0, '_model_dir': '/workspace/wlc/model_dir/', '_service': None, '_session_config': device_count { key: "CPU" value: 1 } device_count { key: "GPU" } , '_save_summary_steps': 1000, '_is_chief': False, '_save_checkpoints_secs': None, '_master': 'grpc://tensorflow-wanglianchen-144-16-ps-0:2222', '_global_id_in_cluster': 2, '_evaluation_master': '', '_keep_checkpoint_max': 5, '_save_checkpoints_steps': 1000, '_task_type': 'ps', '_tf_random_seed': None, '_num_worker_replicas': 2, '_log_step_count_steps': 1000, '_num_ps_replicas': 1, '_keep_checkpoint_every_n_hours': 10000} INFO:tensorflow:Start Tensorflow server. 2018-07-12 17:26:33.154403: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2018-07-12 17:26:33.160418: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job chief -> {0 -> tensorflow-wanglianchen-144-16-worker-0:2222} 2018-07-12 17:26:33.160444: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222} 2018-07-12 17:26:33.160463: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> tensorflow-wanglianchen-144-16-worker-2:2222} 2018-07-12 17:26:33.164749: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222
你好,我想请问下,分布式训练应该怎样才能进行呢?假设有3台机子,现在具体应该怎么进行操作呢?能否提供一些详细的步骤啊,非常谢谢。
worker-1是'task': {'type': 'evaluator', 'index': 0}},ckpt多久保存一次?
参考run_dist.sh
3ks。那么请问下,假设我采用集群分布式进行模型的训练,数据是需要自己进行手动的分割吗?比如用3台机子,是否就意味着原始数据需要分割成3份,每台机子上各自存储一份?还是说不需要进行分割,每台机子都用整个数据集(ps:这个时候感觉所有机子都用了整个数据,没体现分布式训练加速啊)?
@lambdaji 好的 这两个次数设置其中一个,还是有问题不过非常感谢,这几天排查一下问题再说_save_checkpoints_secs save_checkpoints_steps @Welchkimi 那你应该看一下分布式 中的 同步 和异步
需要自己划分,然后代码里根据index决定读取哪一份 ps:把现在代码里glob那一行改了