Yu Zhang comments

Results 89 comments of


                                            Yu Zhang

[Bug]: Backward pass with AMP not working with GLA and GSA

@Niccolo-Ajroldi Hi, before autocast, you should cast the model to bfloat16 by `model.bfloat16()` first.

[Bug]: Backward pass with AMP not working with GLA and GSA

ah, thank u for the reminder, just fix this bug for autocast.

[RFC] Add TTT and Titans kernel

TTT-linear almost done https://github.com/fla-org/flash-linear-attention/pull/151 cc @Pan-Yuqi Need some checks for T

`LigerFusedLinearCrossEntropyLoss` Causes Training Loss to Diverge After Reaching ~8

@penghui-yang Hi, trying to reduce the number of chunks might help you. I hear from my friend that this could help improve the stability. Check out my adapted code, which...

`LigerFusedLinearCrossEntropyLoss` Causes Training Loss to Diverge After Reaching ~8

@penghui-yang How about doing matmuls under tf32? I think this would reduce the accum errors.

[Feature] Support skipping bad grad updates

@fegin @tianyu-l I’d like to share a few scenarios where skipping gradient updates might be beneficial. For instance, when working with data from multiple fields or datasets that aren’t perfectly...

best practice for `snapshot_every_n_steps`

@andrewkho Hi, wondering if this arg will affects the final state loading? For example, if `snapshot_every_n_steps=4`, what will happen if I want to `load_state_dict` or save `state_dict` at the 5th...

best practice for `snapshot_every_n_steps`

@andrewkho thank you for the nice response! That is a very clever tradeoff

best practice for `snapshot_every_n_steps`

@andrewkho It would be better if adding explanations of “fast forward” into docs for anyone curious about it