Birch-san

Results 175 comments of Birch-san

It's possible to install manually. Ignore `WixSetup.msi`. Use your favourite unarchiving utility to extract `cab1.cab`. You'll see the following files: ``` JUCE_framework_GPL3.txt Steinberg_VST3_GPL3.txt juicysfplugin.dll juicysfplugin.exe juicysfplugin.vst3 libFLAC_8.dll libflac_New_BSD.txt libfluidsynth_2.dll libfluidsynth_LGPL_2.1.txt...

ah okay, that sounds problematic. will be sure to update dependencies next time I cut a release.

@mrbumpy409 Windows and macOS releases with latest FluidSynth now available: https://github.com/Birch-san/juicysfplugin/releases

I think the backwards pass may not require any changes. I followed [these steps](https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/#derivative-of-softmax) to compute the softmax derivative, with `1+∑` substituted into the softmax denominator, and still ended up...

thanks for thinking about it! I guess another consideration here is that the currently-foreseeable use-case of this would be used during training only. so optimization effort in the forward pass...

Thanks for explaining. I think the mask **can** be expressed as a coordinate mapping, yes. By that do you mean something like starting with a 2D BoolTensor then using .to_sparse()...

no worries, and congrats on your internship! > any random / non-deterministic masking pattern ah okay yeah, not possible with a MaskDiT-style mask. completely random. > pretty much all throughout...

> * Given Q, K and V of shape `[B, *, heads, dim]`, and any valid kernel size, dilation, causal masking, > > * Take an additional optional boolean tensor,...

I think the random ordering is not a _desired_ property (they don't mention shuffling in [the paper](https://openreview.net/pdf?id=vTBjBtGioE)), I think you're right that it doesn't make a difference, so they went...

okay yeah, MDT uses [the same trick](https://github.com/sail-sg/MDT/blob/7d26c2162c462bd0b90f97f3a1c36cdaaac616ec/masked_diffusion/models.py#L465) of randomly picking n% of tokens, using torch.gather() on the chosen indices to create a shortened sequence. so arbitrary masks will be useful...