YassineYousfi

Results 8 issues of YassineYousfi

enhancement
research

GradScaler has an argument for enabling/disabling the scaler. When disabled, ``scaler.step()`` simply invokes ``optimizer.step()``, and the other methods are no-ops. I thought this made the code a bit cleaner by...

In the manual implementation of causal self-attention, the causal mask is registered as a buffer, which causes DDP to broadcast it at every step. Excluding it from being broadcasted gives...

great work @agrimgupta92! When can we expect the code release? Thanks!

Currently the code only supports bs=1 with input_pos being one dimensional. This fixes input_pos shape in the comments.

CLA Signed