Tommy Li
Tommy Li
Hi @Eric-Zhang1990, it looks like some internal connections are either refused or timed out. If you local area network has low bandwidth, I recommend to deploy FfDL without any monitoring...
Hi @Eric-Zhang1990, it looks like some of the services are not reachable between two of your worker nodes. Also, the errors you have before that fails the liveness test also...
You can check the list of storageclass on your cluster by running `kubectl get storageclass`. Then you can run `export SHARED_VOLUME_STORAGE_CLASS=""` to use your desire storageclass as FfDL's persistent storage....
Hi @bleachzk, can I have some details about your job? (e.g. `$CLI_CMD show training-gSR-qONmR`). If you are requesting GPUs for your training job, do you have any GPU resources available...
Hi @bleachzk, did you deploy `ffdl-lcm` with device-plugin tag? (e.g. `helm install --set lcm.version="device-plugin" .`), since `ffdl-lcm:latest` will use accelerators for GPU resources. After you changed `ffdl-lcm` with `device-plugin` tag,...
Hi @bleachzk , with our new v0.1 release, we require users to install the s3fs drivers on each of their nodes (e.g. using the `storage-plugin` [helm chart](https://github.com/IBM/FfDL#5-detailed-installation-instructions) or follow the...
@bleachzk For your cluster, I think you need to install the s3fs drivers and copy the driver binary `ibmc-s3fs` on each of your worker nodes. ```shell sudo apt-get install s3fs...
Many distributed learning methods required shared file storage to sync with the other workers. Currently all our workers are mounted on the same input and result bucket, so we have...
This GUI is using the same backend as our current GUI. Most of the changes are in the frontend side which will be an enhancement to our current frontend. The...
The currently AIX360 has a hard dependency on Tensorflow 1.14.0 which only can work on Python 3.7 and below. Are there any plan to update Tensorflow to 2.2+ to support...