Abdul Fatir

Results 119 comments of Abdul Fatir

@normster @tridao @albertfgu I believe this feature would be very nice to have in a stable release. Can we work towards merging this into main and have it in the...

No, it does not, so this only testing the padding aspect.

@matteoguarrera Unfortunately, despite trying for several weeks, I wasn't able to reproduce a number anywhere close to what's reported in the paper. Finally, for these datasets, I just copied the...

In the KL expression you have `-1` and you take a sum over the latent dimension. This would just sum to `-z_dim` which is what is written.

@ArthurZucker Thank you for this amazing addition. Are there any plans to add something equivalent to `attention_mask` for Mamba?

- For batched inference with inputs of different length. - For pretraining with different masking schemes than a causal mask.

@ArthurZucker for the T5 family of models, attention bias is required, so flash-attention won't work for now but torch SDPA can still use the memory efficient kernel from xformers, right?...

I can open a PR for T5 with SDPA then. Are there specific things that I should know of or a reference that can look at?

@sayakpaul sorry, I was on vacation. Will look into this now and maybe open a PR in a couple of days. I didn't know that there were diffusion models using...

@fxmarty @ArthurZucker @sayakpaul I have opened a PR #30375 for T5. I still have a couple of questions due to some tests failing. Let's discuss those on the PR.