ru-dalle
ru-dalle copied to clipboard
Sparse attention support
Currently, the inference code creates the entire attention matrix and then masks it. Sparse attention implementations like Triton are more efficient. Does the pre-training code support sparse attention? Will it ever be released?