zzlol63
zzlol63
As discussed on Discord, the issue appears specific to Windows where Torch SDP backend cannot use the native FlashAttention-2 based kernel as it's not compiled with FlashAttention support in the...
Did some further testing, this time with a fresh dataset (with optional masks) and testing with a natively booted Linux distro with identical settings in OneTrainer and came back with...
I ran a set of tests using FLUX.1-dev on the same dataset. I did post some numbers previously but realised I made a huge mistake where the FlexAttention backend wasn't...
> So if on windows, the torch SDP algorithm is much worse, the only alternative would be to use another external flash attention algorithm. For example by using flash_attn (with...