Torsten Scholak

Results 114 comments of Torsten Scholak

Thanks a lot for the comments, @jlamypoirier. I want to clarify that this proposal is not a variation of #155 or #168, and it is not about override machinery. The...

Ok, let me clarify where we're coming from: - This proposal isn't trying to solve all future model configuration problems. It has a deliberately limited scope: **making heterogeneous block stacks...

let's spell out why and when we would need that: - Some models we care about (only Qwen2 at this point) use windowed attention only in some layers but not...

@jlamypoirier it's more relevant than ever.

Appreciate you raising these concerns clearly, @jlamypoirier! The complexities and tradeoffs you're highlighting are worth keeping in mind as we implement. That said, I want to clarify our mindset here:...

I carefully re-read your comment. You are suggesting that instead of tolerating global batch size changes, we should aim to keep it constant by varying gradient accumulation steps, staying within...

I've also checked how we can handle pre-emption cleanly. Good news: every batch system I know (K8s, SLURM, NGC, etc.) sends a `SIGTERM` into the container before it reclaims the...

As a quick follow-up: I went back to #26 and also reviewed the Apriel-5B training logs to check checkpointing speeds under realistic conditions. During Apriel-5B training, all ranks were saving...

Thanks @jlamypoirier, agreed. We will encode the logic with the proposed threshold-based tolerance, default micro-batch and gradient-accumulation-step caps, and fallback strategies if limits are exceeded. We will also make sure...

Can we use https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/ops/triton/cross_entropy.py in Fast-LLM directly?