Yu Zhang
Yu Zhang
@divija96 Hi, could you check your python version? I encountered this error under py27. Please make sure that your environment is python>=3.6.
Could you provide me some examples in the processed prop file. I speculate there might be some errors.
Oh, looks that you may need to switch back to logsigmoid, -exp is not stable yet
This update fixes potential nans during inference, I think it's not the issue. Possibly cuz of potential inf grad of -exp, would check it, thank you
Have you compared the kernel speed
You can enable gradient for h0 mannually
Taking h0 as learnable params would be ok? like `h0 = nn.Parameter(key_dim, head_dim)`
ic, currently there is no access to grad of states. we will add an option later
@JL-er Hi, check it out https://github.com/sustcsonglin/flash-linear-attention/commit/1547448b998a163fdb33c49266da699db13f2dc8 Now we do not truncate grad of h states for RWKV6 for ease of state tuning Do contact us if you met any bugs...
@Ronsor Hi, may I know why you need this. I think it's hard to use `fla` as well if `transformers` is unavailable. Currently this pkg is heavily tied with 🤗...