illioren
Results
2
comments of
illioren
> try add `-fa` to enable flash attention? It could significantly reduce `compute buffer size` when using long context (at least on CUDA). I have the same issue (also running...
The max context size seems to be **31520** (for both 120b and 20b models, 31521 crashes, 31520 works...) No idea if this is significant... but it is 1248 less than...