bleachzk

Results 8 comments of bleachzk

LCM logs: > {"level":"info","msg":"transport: http2Server.HandleStreams failed to read frame: read tcp [::1]:8443-\u003e[::1]:53622: read: connection reset by peer","time":"2018-06-20T08:56:58Z"}

Model definition: Name: tf_convolutional_network_tutorial Description: Convolutional network model using tensorflow Framework: tensorflow:1.7.0-gpu-py3 Training: Status: PENDING Submitted: N/A Completed: N/A Resources: 2.00 CPUs | 1.00 GPUs | 2.00 GB Mem |...

@Tomcli I have set nvidia-device-plugin ![1](https://user-images.githubusercontent.com/5548534/41820030-a8bff524-77fd-11e8-80ac-febccf372ca2.PNG)

@Tomcli after upgrade to v0.1,leaner pod start error: > MountVolume.SetUp failed for volume "cosoutputmount-107082b8-77ec-4686-61f6-b87d630babfb" : mount command failed, status: Failure, reason: Error mounting volume: s3fs mount failed: s3fs: error while...

System version:CentOS 7.2 3.10.0-514.26.2.el7.x86_64 Kubernetes version:1.10 Docker version:CE 18.03 @Tomcli

@jiarunying 谢谢回复。 1. 我在本地测试分布式训练模型保存的时候,如果PS和Worker不是共享存储路径的话,保存模型的时候会报错误:NotFoundError (see above for traceback): xxxx_model/1 variables/variables_temp_ae346506332a4adc801e21a63e1c3314; 2. 如果PS和Worker的输出路径是用NFS共享存储的话是可以正确保存; 3. Tensorflow Serving好像不支持加载分布式训练得到的模型。

> 等我实现一下 有时间表吗?我现在RPC认证这一块还有问题,想请教一下 @zhaiyuyong