Andrew Gu
Andrew Gu
@pytorchbot merge
@pytorchbot merge
@pytorchbot merge
@pytorchbot merge
@pytorchbot merge
I am curious if we have any experiments to see the performance difference with `fused=True`.
@weifengpy `foreach=True` used to be the default, so perhaps your package was before https://github.com/pytorch/torchtitan/pull/386 landed. Without https://github.com/pytorch/torchtitan/pull/386, the optimizer would fall back to `foreach=False` when `fused=False`. 2000 ms for optimizer...
Should we call out that this table assumes that we are only applying QLoRA to the FFNs?
cc: @tianyu-l maybe? since @wanchaol is out for a bit
@bosmart I think that if the feed forward has both `w1` and `w3` that are `ColwiseParallel` (like SwiGLU), then we prefer to only redistribute the input from `(S(1),) -> (R,)`...