Birch-san
Birch-san
> I would assume re-ordering them back into place once would definitely be more performant than using a gather/scatter iterator, at least if we're calling NA more than once. yeah,...
noticed a fun algebraic simplification, so I'll note it here. in the case where: - `train_kernel_size == inference_kernel_size` - dropout was used during training - no masking was used during...
> possible to have just one floating point scale for all queries? not really. each query attends to a different number of keys. it's hard for the user to compute...
> Fused attention kernels generally don't guarantee realizing all attention weights for even 1 query ah, okay. if neither the user nor NATTEN can know (early enough to inform scaling)...
here's how to formulate it as an EDM target: https://github.com/crowsonkb/k-diffusion/blob/6ab5146d4a5ef63901326489f31f1d8e7dd36b48/k_diffusion/layers.py#L65 here's how to formulate it as an x0 loss weighting: https://github.com/Birch-san/k-diffusion/blob/9bce54aec1e596548cf73f56f4842c11aa6271c6/k_diffusion/layers.py#L160 here's an alternative style for expressing it as an...