Torsten Scholak comments

Results 114 comments of


                                            Torsten Scholak

Support block-modular architecture

Thanks a lot for the comments, @jlamypoirier. I want to clarify that this proposal is not a variation of #155 or #168, and it is not about override machinery. The...

Support block-modular architecture

Ok, let me clarify where we're coming from: - This proposal isn't trying to solve all future model configuration problems. It has a deliberately limited scope: **making heterogeneous block stacks...

Option to vary configuration parameters across layers

let's spell out why and when we would need that: - Some models we care about (only Qwen2 at this point) use windowed attention only in some layers but not...

[feat] Add data cleaning in `fast-llm prepare`

@jlamypoirier it's more relevant than ever.

Elastic training

Appreciate you raising these concerns clearly, @jlamypoirier! The complexities and tradeoffs you're highlighting are worth keeping in mind as we implement. That said, I want to clarify our mindset here:...

Elastic training

I carefully re-read your comment. You are suggesting that instead of tolerating global batch size changes, we should aim to keep it constant by varying gradient accumulation steps, staying within...

Elastic training

I've also checked how we can handle pre-emption cleanly. Good news: every batch system I know (K8s, SLURM, NGC, etc.) sends a `SIGTERM` into the container before it reclaims the...

Elastic training

As a quick follow-up: I went back to #26 and also reviewed the Apriel-5B training logs to check checkpointing speeds under realistic conditions. During Apriel-5B training, all ranks were saving...

Elastic training

Thanks @jlamypoirier, agreed. We will encode the logic with the proposed threshold-based tolerance, default micro-batch and gradient-accumulation-step caps, and fallback strategies if limits are exceeded. We will also make sure...

[feat] Support triton cross-entropy for larger vocabularies

Can we use https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/ops/triton/cross_entropy.py in Fast-LLM directly?