Yu Zhang

Results 91 comments of Yu Zhang

@Triang-jyed-driung Good point! Your contributions are welcome, we will test `0.1 * sqrt(1/d)` recently.

> Include a flag (e.g., use_default_init) in the model configuration or constructor. When set to False, this flag would disable FLA's default initialization logic entirely, allowing users to apply their...

@conceptofmind Hi, why not directly using `torch.compile`? I think it would also lead to reasonable speedup.

@conceptofmind In your case, this is necessary. `torch.compile` is still not that wise to avoid one additional activation (which can be recomputed in bwd in a cheap way). But it...

@conceptofmind I think no lol, which makes designing TP plans much harder. For native torch dtensor APIs, one need to design module pre/post hooks to handle inputs outputs, so we...

@conceptofmind Thank you in advance!

Checkout https://github.com/fla-org/flash-linear-attention/commit/f14178c233725a2484540bdc413fb1086de279cc Lightning Attention has been integrated into `fla`.

@Triang-jyed-driung Hi, I think a better way is to modify the config file for the pretrained ckpt. Have you tried ```py AutoModelForCausalLM.from_pretrained(args.model_name, trust_remote_code=True, fuse_norm=True) ```

@n2729648074 Have you found the problems? I do not have envs in hand reproducing the bugs :-(

@cgz6498 Hi, can you reproduce the problems when running the example code in https://github.com/sustcsonglin/flash-linear-attention/tree/main/training