Daniel Rasmussen comments

Results 86 comments of


                                            Daniel Rasmussen

New optimizers fail to load CUDA installed through conda

Just checked, and it produces the same error as before. Here are the reproduction steps (I updated the installation instructions to match the changes for TF 2.12 here https://www.tensorflow.org/install/pip#linux): ```...

New optimizers fail to load CUDA installed through conda

Yes, that makes the problem go away, although I would hesitate to call it a solution as that's quite a cumbersome process to repeat every time we create a new...

[BUG][all_reduce] INVALID_ARGUMENT: You must feed a value for placeholder tensor

As an additional update, I believe that this bug is triggered whenever applying the `Adam` optimizer in a distributed context (I haven't done an exhaustive search over the optimizers, that's...

Keras 3 gives incorrect output from evaluate/fit in distributed context

With a bit more investigation I figured out that what's going on is that `evaluate` is only reporting the loss from the first replica, and ignoring the rest. Here's an...

Keras 3 gives incorrect output from evaluate/fit in distributed context

One more piece of investigation. I believe the above issue with `evaluate` is mainly a display issue. The model is computing the loss value correctly in each replica, but only...

Keras 3 gives incorrect output from evaluate/fit in distributed context

I think I was able to get past this issue, but then I run into this bug https://github.com/keras-team/keras/issues/19246 so I can't really tell if things are working correctly or not.

Keras 3 gives incorrect output from evaluate/fit in distributed context

No, the issue is not resolved. I had been working on a fix locally, but was unable to verify it due to that other bug. But this issue itself is...

Keras 3 gives incorrect output from evaluate/fit in distributed context

This issue will still require a pull request (or two) of its own to fix, it definitely won't be resolved on its own after #19246 is fixed.

Keras 3 gives incorrect output from evaluate/fit in distributed context

I believe it's actually two separate issues (both requiring fixes). One is the wrong value being returned from evaluate. The other is that the gradient aggregation is not happening, so...

Keras 3 gives incorrect output from evaluate/fit in distributed context

I haven't had a chance to dig into it more. I believe there was an attempt to fix this here https://github.com/keras-team/keras/pull/19969, but then that was reverted so I'm not sure...