xla
xla copied to clipboard
Gradient bucketing using a pre-defined bucket size cap
Do you mind adding a test case?
Added the test case and rebased @JackCaoG @alanwaketan
@JackCaoG do you know why the build failed with "ERROR: Error initializing RemoteModule"?
It is on a fork hence can't use remote cache but there was a bug that it still try to query the credintical. I think we fixed this issue error today, it should start building without cache. If you rebase the CI should start running.
@JackCaoG looks like build is still failing for some reason after rebasing. Maybe another rebase is needed?
The error still seems to be related with the fork. Let me grant both of you the write access, then you can open pr directly.
OK I gave @amithrm write access
Replaced by https://github.com/pytorch/xla/pull/7216 to avoid the build issues in CI testing.