FfDL icon indicating copy to clipboard operation
FfDL copied to clipboard

Training status is PENDING not change

Open bleachzk opened this issue 7 years ago • 9 comments

image

I can‘t get any error log ...

bleachzk avatar Jun 20 '18 08:06 bleachzk

LCM logs:

{"level":"info","msg":"transport: http2Server.HandleStreams failed to read frame: read tcp [::1]:8443-\u003e[::1]:53622: read: connection reset by peer","time":"2018-06-20T08:56:58Z"}

bleachzk avatar Jun 20 '18 09:06 bleachzk

Hi @bleachzk, can I have some details about your job? (e.g. $CLI_CMD show training-gSR-qONmR). If you are requesting GPUs for your training job, do you have any GPU resources available on your Kubernetes Cluster?

Tomcli avatar Jun 20 '18 21:06 Tomcli

Model definition: Name: tf_convolutional_network_tutorial Description: Convolutional network model using tensorflow Framework: tensorflow:1.7.0-gpu-py3 Training: Status: PENDING Submitted: N/A Completed: N/A Resources: 2.00 CPUs | 1.00 GPUs | 2.00 GB Mem | 1 node(s) Command: python3 convolutional_network.py --trainImagesFile ${DATA_DIR}/train-images-idx3-ubyte.gz --trainLabelsFile ${DATA_DIR}/train-labels-idx1-ubyte.gz --testImagesFile ${DATA_DIR}/t10k-images-idx3-ubyte.gz --testLabelsFile ${DATA_DIR}/t10k-labels-idx1-ubyte.gz --learningRate 0.001 --trainingIters 20000 Input data : sl-internal-os-input Output data: sl-internal-os-output Data stores: ID: sl-internal-os-input Type: s3_datastore Connection: auth_url: http://s3.default.svc.cluster.local bucket: tf_training_data password: test type: s3_datastore user_name: test ID: sl-internal-os-output Type: s3_datastore Connection: auth_url: http://s3.default.svc.cluster.local bucket: tf_trained_model password: test type: s3_datastore user_name: test Summary metrics: OK

bleachzk avatar Jun 24 '18 14:06 bleachzk

@Tomcli I have set nvidia-device-plugin 1

bleachzk avatar Jun 24 '18 14:06 bleachzk

Hi @bleachzk, did you deploy ffdl-lcm with device-plugin tag? (e.g. helm install --set lcm.version="device-plugin" .), since ffdl-lcm:latest will use accelerators for GPU resources.

After you changed ffdl-lcm with device-plugin tag, all the new GPU jobs should consume nvidia.com/gpu resources.

As accelerator deprecated in K8s 1.10, we will add a new pre-0.1 patch to FfDL this week to use device-plugin as default.

Tomcli avatar Jun 25 '18 18:06 Tomcli

@Tomcli after upgrade to v0.1,leaner pod start error:

MountVolume.SetUp failed for volume "cosoutputmount-107082b8-77ec-4686-61f6-b87d630babfb" : mount command failed, status: Failure, reason: Error mounting volume: s3fs mount failed: s3fs: error while loading shared libraries: libfuse.so.2: cannot open shared object file: No such file or directory

bleachzk avatar Jul 02 '18 16:07 bleachzk

Hi @bleachzk , with our new v0.1 release, we require users to install the s3fs drivers on each of their nodes (e.g. using the storage-plugin helm chart or follow the ibmcloud-object-storage-plugin instructions ).

Since the s3fs installation may vary based on different Kubernetes environment, I can point you to a more specific instruction if you can let me know what kind of Kubernetes environment your are using.

Thanks.

Tomcli avatar Jul 02 '18 17:07 Tomcli

System version:CentOS 7.2 3.10.0-514.26.2.el7.x86_64 Kubernetes version:1.10 Docker version:CE 18.03 @Tomcli

bleachzk avatar Jul 03 '18 01:07 bleachzk

@bleachzk For your cluster, I think you need to install the s3fs drivers and copy the driver binary ibmc-s3fs on each of your worker nodes.

sudo apt-get install s3fs
sudo mkdir -p /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs
sudo cp <FfDL repo>/bin/ibmc-s3fs /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs
sudo chmod +x /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs/ibmc-s3fs
sudo systemctl restart kubelet

Then, install the storage-plugin helm chart if you haven't done it.

helm install storage-plugin --set cloud=false

Then your learner pods should able to mount on any S3 Object Storage.

Tomcli avatar Jul 03 '18 22:07 Tomcli