YassineYousfi
Results
8
issues of
YassineYousfi
GradScaler has an argument for enabling/disabling the scaler. When disabled, ``scaler.step()`` simply invokes ``optimizer.step()``, and the other methods are no-ops. I thought this made the code a bit cleaner by...
In the manual implementation of causal self-attention, the causal mask is registered as a buffer, which causes DDP to broadcast it at every step. Excluding it from being broadcasted gives...
great work @agrimgupta92! When can we expect the code release? Thanks!
Currently the code only supports bs=1 with input_pos being one dimensional. This fixes input_pos shape in the comments.
CLA Signed