Stas Bekman

Results 664 comments of Stas Bekman

as this PR currently breaks under cpu offload, and very likely with grad accum > 1 (unrelated issues) Tunji, please kindly add to the OP a TODO list: 1. test...

Let's list current breakages with this PR: 1. not working under cpu_offload since elf.averaged_gradients doesn't get populated in stage_1_and_2.py ``` if self.cpu_offload is False: for i, _ in enumerate(self.bit16_groups): if...

Also a bonus feature would be to be able to return unscaled and clipped grads (before `step`)

OK, so with the fix above z3/grads works fine with 1 gpu w/ and w/o offload - great. but with 2 gpus, sometimes it works, sometimes it doesn't. for example...

ok, so the following version will allow any lengths of shards: ``` def get_fp32_grad_for_param(self, param) -> Tensor: self.__reduce_and_partition_stream.synchronize() if self.offload_optimizer: group_idx, dest_offset, num_elements = self.grad_position[self.get_param_id(param)] fp32_grad = self.fp32_partitioned_groups_flat[group_idx].grad.narrow( 0, dest_offset,...

When we finish z3 here is an important next feature to support for `grads` getter: support flags `clipped=True/False` and `scaled=True/False` - currently both get done in `step`, but it'd be...

one more but important small todo item - doc - should we add it to some file under `doc` in the repo? Something along the lines: ``` backward(loss) [...] from...

Yes, that would be perfect!

Sorry to hear it's broken again, Lucile. Have you by chance validated with `py-spy` that it's the same issue as before or possibly a new one? Alternatively, if you apply...

That's exciting, Lucile - thank you for the update - we can then focus on solving other issues.