Abdul Fatir
Abdul Fatir
@normster @tridao @albertfgu I believe this feature would be very nice to have in a stable release. Can we work towards merging this into main and have it in the...
No, it does not, so this only testing the padding aspect.
@matteoguarrera Unfortunately, despite trying for several weeks, I wasn't able to reproduce a number anywhere close to what's reported in the paper. Finally, for these datasets, I just copied the...
In the KL expression you have `-1` and you take a sum over the latent dimension. This would just sum to `-z_dim` which is what is written.
@ArthurZucker Thank you for this amazing addition. Are there any plans to add something equivalent to `attention_mask` for Mamba?
- For batched inference with inputs of different length. - For pretraining with different masking schemes than a causal mask.
@ArthurZucker for the T5 family of models, attention bias is required, so flash-attention won't work for now but torch SDPA can still use the memory efficient kernel from xformers, right?...
I can open a PR for T5 with SDPA then. Are there specific things that I should know of or a reference that can look at?
@sayakpaul sorry, I was on vacation. Will look into this now and maybe open a PR in a couple of days. I didn't know that there were diffusion models using...
@fxmarty @ArthurZucker @sayakpaul I have opened a PR #30375 for T5. I still have a couple of questions due to some tests failing. Let's discuss those on the PR.