bleachzk
bleachzk
LCM logs: > {"level":"info","msg":"transport: http2Server.HandleStreams failed to read frame: read tcp [::1]:8443-\u003e[::1]:53622: read: connection reset by peer","time":"2018-06-20T08:56:58Z"}
Model definition: Name: tf_convolutional_network_tutorial Description: Convolutional network model using tensorflow Framework: tensorflow:1.7.0-gpu-py3 Training: Status: PENDING Submitted: N/A Completed: N/A Resources: 2.00 CPUs | 1.00 GPUs | 2.00 GB Mem |...
@Tomcli I have set nvidia-device-plugin 
@Tomcli after upgrade to v0.1,leaner pod start error: > MountVolume.SetUp failed for volume "cosoutputmount-107082b8-77ec-4686-61f6-b87d630babfb" : mount command failed, status: Failure, reason: Error mounting volume: s3fs mount failed: s3fs: error while...
System version:CentOS 7.2 3.10.0-514.26.2.el7.x86_64 Kubernetes version:1.10 Docker version:CE 18.03 @Tomcli
get the same problem
@jiarunying 谢谢回复。 1. 我在本地测试分布式训练模型保存的时候,如果PS和Worker不是共享存储路径的话,保存模型的时候会报错误:NotFoundError (see above for traceback): xxxx_model/1 variables/variables_temp_ae346506332a4adc801e21a63e1c3314; 2. 如果PS和Worker的输出路径是用NFS共享存储的话是可以正确保存; 3. Tensorflow Serving好像不支持加载分布式训练得到的模型。
> 等我实现一下 有时间表吗?我现在RPC认证这一块还有问题,想请教一下 @zhaiyuyong