Tommy Li comments

Results 187 comments of


                                            Tommy Li

kubectl get pods :lcm ContainerCreating,prometheus trainer and trainingdata STATUS CrashLoopBackOff

Hi @Eric-Zhang1990, it looks like some internal connections are either refused or timed out. If you local area network has low bandwidth, I recommend to deploy FfDL without any monitoring...

kubectl get pods :lcm ContainerCreating,prometheus trainer and trainingdata STATUS CrashLoopBackOff

Hi @Eric-Zhang1990, it looks like some of the services are not reachable between two of your worker nodes. Also, the errors you have before that fails the liveness test also...

kubectl get pods :lcm ContainerCreating,prometheus trainer and trainingdata STATUS CrashLoopBackOff

You can check the list of storageclass on your cluster by running `kubectl get storageclass`. Then you can run `export SHARED_VOLUME_STORAGE_CLASS=""` to use your desire storageclass as FfDL's persistent storage....

Training status is PENDING not change

Hi @bleachzk, can I have some details about your job? (e.g. `$CLI_CMD show training-gSR-qONmR`). If you are requesting GPUs for your training job, do you have any GPU resources available...

Training status is PENDING not change

Hi @bleachzk, did you deploy `ffdl-lcm` with device-plugin tag? (e.g. `helm install --set lcm.version="device-plugin" .`), since `ffdl-lcm:latest` will use accelerators for GPU resources. After you changed `ffdl-lcm` with `device-plugin` tag,...

Training status is PENDING not change

Hi @bleachzk , with our new v0.1 release, we require users to install the s3fs drivers on each of their nodes (e.g. using the `storage-plugin` [helm chart](https://github.com/IBM/FfDL#5-detailed-installation-instructions) or follow the...

Training status is PENDING not change

@bleachzk For your cluster, I think you need to install the s3fs drivers and copy the driver binary `ibmc-s3fs` on each of your worker nodes. ```shell sudo apt-get install s3fs...

VCK integration proposal

Many distributed learning methods required shared file storage to sync with the other workers. Currently all our workers are mounted on the same input and result bucket, so we have...

FfDL UI update

This GUI is using the same backend as our current GUI. Most of the changes are in the frontend side which will be an enhancement to our current frontend. The...

Feature request: Update Tensorflow version to be compatible with Python 3.8+

The currently AIX360 has a hard dependency on Tensorflow 1.14.0 which only can work on Python 3.7 and below. Are there any plan to update Tensorflow to 2.2+ to support...