Birch-san
Birch-san
It's possible to install manually. Ignore `WixSetup.msi`. Use your favourite unarchiving utility to extract `cab1.cab`. You'll see the following files: ``` JUCE_framework_GPL3.txt Steinberg_VST3_GPL3.txt juicysfplugin.dll juicysfplugin.exe juicysfplugin.vst3 libFLAC_8.dll libflac_New_BSD.txt libfluidsynth_2.dll libfluidsynth_LGPL_2.1.txt...
ah okay, that sounds problematic. will be sure to update dependencies next time I cut a release.
@mrbumpy409 Windows and macOS releases with latest FluidSynth now available: https://github.com/Birch-san/juicysfplugin/releases
I think the backwards pass may not require any changes. I followed [these steps](https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/#derivative-of-softmax) to compute the softmax derivative, with `1+∑` substituted into the softmax denominator, and still ended up...
thanks for thinking about it! I guess another consideration here is that the currently-foreseeable use-case of this would be used during training only. so optimization effort in the forward pass...
Thanks for explaining. I think the mask **can** be expressed as a coordinate mapping, yes. By that do you mean something like starting with a 2D BoolTensor then using .to_sparse()...
no worries, and congrats on your internship! > any random / non-deterministic masking pattern ah okay yeah, not possible with a MaskDiT-style mask. completely random. > pretty much all throughout...
> * Given Q, K and V of shape `[B, *, heads, dim]`, and any valid kernel size, dilation, causal masking, > > * Take an additional optional boolean tensor,...
I think the random ordering is not a _desired_ property (they don't mention shuffling in [the paper](https://openreview.net/pdf?id=vTBjBtGioE)), I think you're right that it doesn't make a difference, so they went...
okay yeah, MDT uses [the same trick](https://github.com/sail-sg/MDT/blob/7d26c2162c462bd0b90f97f3a1c36cdaaac616ec/masked_diffusion/models.py#L465) of randomly picking n% of tokens, using torch.gather() on the chosen indices to create a shortened sequence. so arbitrary masks will be useful...