bleachzk comments

Results 8 comments of


                                            bleachzk

Training status is PENDING not change

LCM logs: > {"level":"info","msg":"transport: http2Server.HandleStreams failed to read frame: read tcp [::1]:8443-\u003e[::1]:53622: read: connection reset by peer","time":"2018-06-20T08:56:58Z"}

Training status is PENDING not change

Model definition: Name: tf_convolutional_network_tutorial Description: Convolutional network model using tensorflow Framework: tensorflow:1.7.0-gpu-py3 Training: Status: PENDING Submitted: N/A Completed: N/A Resources: 2.00 CPUs | 1.00 GPUs | 2.00 GB Mem |...

Training status is PENDING not change

@Tomcli I have set nvidia-device-plugin ![1](https://user-images.githubusercontent.com/5548534/41820030-a8bff524-77fd-11e8-80ac-febccf372ca2.PNG)

Training status is PENDING not change

@Tomcli after upgrade to v0.1，leaner pod start error： > MountVolume.SetUp failed for volume "cosoutputmount-107082b8-77ec-4686-61f6-b87d630babfb" : mount command failed, status: Failure, reason: Error mounting volume: s3fs mount failed: s3fs: error while...

Training status is PENDING not change

System version：CentOS 7.2 3.10.0-514.26.2.el7.x86_64 Kubernetes version：1.10 Docker version：CE 18.03 @Tomcli

GPU isolation not working after setting default runtime to nvidia

get the same problem

Tensorflow版本兼容和模型保存

@jiarunying 谢谢回复。 1. 我在本地测试分布式训练模型保存的时候，如果PS和Worker不是共享存储路径的话，保存模型的时候会报错误：NotFoundError (see above for traceback): xxxx_model/1 variables/variables_temp_ae346506332a4adc801e21a63e1c3314； 2. 如果PS和Worker的输出路径是用NFS共享存储的话是可以正确保存； 3. Tensorflow Serving好像不支持加载分布式训练得到的模型。

XLearning是否支持安全集群？

> 等我实现一下有时间表吗？我现在RPC认证这一块还有问题，想请教一下 @zhaiyuyong