illioren comments

Repositories
Issues
Comments

Results 2 comments of


                                            illioren

Eval bug: GPT-OSS-120B: Vulkan backend fails to allocate KV cache with OOM error, despite enough free memory

> try add `-fa` to enable flash attention? It could significantly reduce `compute buffer size` when using long context (at least on CUDA). I have the same issue (also running...

Eval bug: GPT-OSS-120B: Vulkan backend fails to allocate KV cache with OOM error, despite enough free memory

The max context size seems to be **31520** (for both 120b and 20b models, 31521 crashes, 31520 works...) No idea if this is significant... but it is 1248 less than...