chongxiaoc

Results 37 comments of chongxiaoc

It looks this PR needs to be rebased after https://github.com/horovod/horovod/pull/3665 is merged. `make_dataset_fn()` in keras is changed there.

@leewyang @EnricoMi I will take a look asap. Thanks.

As my (limited) understanding of using horovod with tensorflow: - The collective operations like allreduce, allgather and broadcast, are used for communication between ranks, other than being defined as TensorFlow...

Can you try this patch as well? It looks like the allreduce kernal is multiplying postscale_factor internally, having to convert it back when calculating local grad. Though I'm not sure...

> Thanks @chongxiaoc. > > Interesting, the `test_tensorflow.TensorFlowTests().test_horovod_allgather_grad_cpu()` fails... same for `test_horovod_broadcast_grad_cpu()`. > > Is this scaling really expected? When running the following in `horovod/test/parallel`: > > ``` > horovodrun...

Okay, now I think a common question is rising from both of us, why `horovod` is using `sum` to allreduce all grads for collective operations, rather than `average`? For example:...

@adamelk Is there a simple reproducer I can try?

I think the reason behind the scene is that `KerasEstimator dataloader` would generate same number of feature tensors and target tensors to match model input and output layers. So if...

Using PTL `1.6.3` has some errors with validation step in fit function: ``` outputs = [] def validation_epoch_end(self, outputs): > avg_loss = torch.stack(outputs).mean() E RuntimeError: stack expects a non-empty TensorList...