chongxiaoc
chongxiaoc
It looks this PR needs to be rebased after https://github.com/horovod/horovod/pull/3665 is merged. `make_dataset_fn()` in keras is changed there.
@leewyang @EnricoMi I will take a look asap. Thanks.
As my (limited) understanding of using horovod with tensorflow: - The collective operations like allreduce, allgather and broadcast, are used for communication between ranks, other than being defined as TensorFlow...
@loic-ehrhardt Your fix is helpful, I will look at it.
Can you try this patch as well? It looks like the allreduce kernal is multiplying postscale_factor internally, having to convert it back when calculating local grad. Though I'm not sure...
> Thanks @chongxiaoc. > > Interesting, the `test_tensorflow.TensorFlowTests().test_horovod_allgather_grad_cpu()` fails... same for `test_horovod_broadcast_grad_cpu()`. > > Is this scaling really expected? When running the following in `horovod/test/parallel`: > > ``` > horovodrun...
Okay, now I think a common question is rising from both of us, why `horovod` is using `sum` to allreduce all grads for collective operations, rather than `average`? For example:...
@adamelk Is there a simple reproducer I can try?
I think the reason behind the scene is that `KerasEstimator dataloader` would generate same number of feature tensors and target tensors to match model input and output layers. So if...
Using PTL `1.6.3` has some errors with validation step in fit function: ``` outputs = [] def validation_epoch_end(self, outputs): > avg_loss = torch.stack(outputs).mean() E RuntimeError: stack expects a non-empty TensorList...