FfDL
FfDL copied to clipboard
Training status is PENDING not change

I can‘t get any error log ...
LCM logs:
{"level":"info","msg":"transport: http2Server.HandleStreams failed to read frame: read tcp [::1]:8443-\u003e[::1]:53622: read: connection reset by peer","time":"2018-06-20T08:56:58Z"}
Hi @bleachzk, can I have some details about your job? (e.g. $CLI_CMD show training-gSR-qONmR). If you are requesting GPUs for your training job, do you have any GPU resources available on your Kubernetes Cluster?
Model definition: Name: tf_convolutional_network_tutorial Description: Convolutional network model using tensorflow Framework: tensorflow:1.7.0-gpu-py3 Training: Status: PENDING Submitted: N/A Completed: N/A Resources: 2.00 CPUs | 1.00 GPUs | 2.00 GB Mem | 1 node(s) Command: python3 convolutional_network.py --trainImagesFile ${DATA_DIR}/train-images-idx3-ubyte.gz --trainLabelsFile ${DATA_DIR}/train-labels-idx1-ubyte.gz --testImagesFile ${DATA_DIR}/t10k-images-idx3-ubyte.gz --testLabelsFile ${DATA_DIR}/t10k-labels-idx1-ubyte.gz --learningRate 0.001 --trainingIters 20000 Input data : sl-internal-os-input Output data: sl-internal-os-output Data stores: ID: sl-internal-os-input Type: s3_datastore Connection: auth_url: http://s3.default.svc.cluster.local bucket: tf_training_data password: test type: s3_datastore user_name: test ID: sl-internal-os-output Type: s3_datastore Connection: auth_url: http://s3.default.svc.cluster.local bucket: tf_trained_model password: test type: s3_datastore user_name: test Summary metrics: OK
@Tomcli I have set nvidia-device-plugin
Hi @bleachzk, did you deploy ffdl-lcm with device-plugin tag? (e.g. helm install --set lcm.version="device-plugin" .), since ffdl-lcm:latest will use accelerators for GPU resources.
After you changed ffdl-lcm with device-plugin tag, all the new GPU jobs should consume nvidia.com/gpu resources.
As accelerator deprecated in K8s 1.10, we will add a new pre-0.1 patch to FfDL this week to use device-plugin as default.
@Tomcli after upgrade to v0.1,leaner pod start error:
MountVolume.SetUp failed for volume "cosoutputmount-107082b8-77ec-4686-61f6-b87d630babfb" : mount command failed, status: Failure, reason: Error mounting volume: s3fs mount failed: s3fs: error while loading shared libraries: libfuse.so.2: cannot open shared object file: No such file or directory
Hi @bleachzk , with our new v0.1 release, we require users to install the s3fs drivers on each of their nodes (e.g. using the storage-plugin helm chart or follow the ibmcloud-object-storage-plugin instructions ).
Since the s3fs installation may vary based on different Kubernetes environment, I can point you to a more specific instruction if you can let me know what kind of Kubernetes environment your are using.
Thanks.
System version:CentOS 7.2 3.10.0-514.26.2.el7.x86_64 Kubernetes version:1.10 Docker version:CE 18.03 @Tomcli
@bleachzk For your cluster, I think you need to install the s3fs drivers and copy the driver binary ibmc-s3fs on each of your worker nodes.
sudo apt-get install s3fs
sudo mkdir -p /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs
sudo cp <FfDL repo>/bin/ibmc-s3fs /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs
sudo chmod +x /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs/ibmc-s3fs
sudo systemctl restart kubelet
Then, install the storage-plugin helm chart if you haven't done it.
helm install storage-plugin --set cloud=false
Then your learner pods should able to mount on any S3 Object Storage.