Abin Shahab
Abin Shahab
@EnricoMi thanks for catching this. The issue is that the default value of placement_group_timeout_s is not being applied. I'll try to take a look this week.
@n-balla Tensorflow Keras has callbacks that would allow you to access the current step number. If you are implementing a custom loop then each worker will have access to the...
@tanmoyio , By databricks, you mean you are running the jobs(pytorch? Tensorflow?) inside spark? Can you explain how you are doing inference?
@yundai424 I am wondering if it's related to the other callbacks on that [example](https://github.com/horovod/horovod/blob/master/examples/tensorflow2/tensorflow2_keras_mnist.py#L74) that allreduce at epoch boundaries. Can you try removing those callbacks to narrow the problem down?
Actually I do have time next week if you are fine waiting. This is an awesome project, I'd like to contribute. On Thu, Mar 21, 2019, 11:35 PM Ce Gao...
Can you elaborate on the following: "Define the controller logic in the controllers directory. This is where you will download the protobuf file, parse its contents into a Ray DAG,...
What would the reconciliation loop of the controller do?
Can you implement the reconciliation loop in golang?