Yu Zhang
Yu Zhang
@juankost closing this as answered by the link you provided. We've not experimented too much with `torch.compile` yet. Any PRs are welcome if you fix this issue.
@OREYR Hi, not knowing what happened, can you provide more details about exp settings, scheduler, data, model framework, etc.
@OREYR Does that mean you randomly init your model again? For newly init models, lr of `1e-5` is too small.
Thank you for reporting this. I will have a check.
@OREYR looks like you wrap the classifier with LoRA as well, and the orginal random params are freezed?
@OREYR one thing to confirm: how is MLP called in your peft modules? I wrote some fused kernels in this module to save mems, so please check the impls to...
@OREYR Can you paste the full runnable script from which I can observe the abnormal values here?
@sayakpaul FYI, we release some weights converted from Mistral-7B-v0.1 as in [arXiv:2405.06640](https://arxiv.org/abs/2405.06640). You can have a try by loading `fla-hub/gla-7B-mistral-20B`, `fla-hub/gsa-7B-mistral-20B` or `fla-hub/gsa-7B-mistral-100B`
@Yingyue-L Hi, refer to #32 for more throughput comparisons. You may need a larger seq_len to fully unlock the potentials of linear attns (LA), as shown in [DeltaNet](https://arxiv.org/abs/2406.06484), LAs do...
@howard-hou Thanks for reporting this issue, we will have a check soon