Jae-Won Chung

Results 56 comments of Jae-Won Chung

I am very sorry for the delayed response. I'm recently getting little free time to actively maintain this repository. Similar issues have arisen quite frequently, and a PR is welcome....

It looks like the cause is https://github.com/pypa/setuptools_scm/issues/457. Reproduction steps: 1. `docker run -it --gpus all deepspeed/deepspeed:latest_torch111 bash` - Probably doesn't exactly have to be `latest_torch111`. 1. `git clone --depth=1 https://github.com/microsoft/deepspeech.git`...

Nah, I just reverted to an older version of Deepspeech2 that didn't use PyTorch Lightning and integrated adaptdl there.

> There are two potential approaches to address this issue, although additional options may also exist: > > * Making a change at the NVML library’s side to reduce the...

Hi @tohtana :) Maybe you meant this, but I think what's happening is `RecvActivation(buffer_id=0)` for writing to `self.pipe_buffers['input'][0]`, thereby removing the tensors that hold the gradients (Jacobian-vector products) produced by...

EDIT: Wrong Manually fixing three lines would look like: ```diff >>> pprint(list(FixBufferTrainSchedule(8, 4, 2)), width=120) [[-1], [-2], [0, RecvActivation(buffer_id=0), ForwardPass(buffer_id=0)], [-1, SendActivation(buffer_id=0)], [1, RecvActivation(buffer_id=1), ForwardPass(buffer_id=1)], [0, SendActivation(buffer_id=1), RecvGrad(buffer_id=0), BackwardPass(buffer_id=0)], -...

> I confirmed that `RecvActivation(buffer_id=0)` updated `self.pipe_buffers['inputs'][buffer_id]` once I fixed `num_pipe_buffers()`. The new value does not have `.grad` and is not even the one computed from the desired microbatch. Is...

Oh I see. Putting `RecvGrad` in front of `SendActivation` is not a problem, because `RecvGrad` actually doesn't write to `self.pipe_buffers` but rather `self.grad_layer`, and the output buffer ID is only...

The schedule looks good to me. Probably you can try out the alexnet example in DeepSpeedExamples with a fixed random seed and compare the loss value before and after the...

Thank you for running these! My understanding is that all computation inputs and outputs should be bit-level equivalent before and after the fix, and thus for every training step, the...