Option to vary configuration parameters across layers

Open jlamypoirier opened this issue 9 months ago • 1 comments

🎯 Goal (What & Why)

We have several use-cases for varying parameters across layers (#147, #153) and will likely have many more in the future.

Best and simplest way to implement this would be a per-layer override mechanism based on #154, ex

transformer:
  [...]
  window_size: 8192
  overrides:
    - layers: 0:24:2
      config:
        window_size: null

This is relatively simple to do once we have an override mechanism (#154)

(Describe the simplest way to implement this feature with minimal effort.)

(List potential refinements that can be added in later PRs if needed.)

Feb 19 '25 21:02 jlamypoirier

let's spell out why and when we would need that:

Some models we care about (only Qwen2 at this point) use windowed attention only in some layers but not throughout. this can be supported simply as done in #157, but could eventually be generalized.
Qwen2 also adds linear bias terms to q, k, and v, but does that consistently in all transformer blocks and layers. This therefore doesn't require different configurations across layers.
We are interested in bringing SSM-transformer hybrids to Fast-LLM. #68 is only the beginning. We will eventually be interested in exploring different stacks of SSM and transformer blocks, but this is a while off.

So the conclusion is that there is no urgency to support this feature at this point in time.

Feb 20 '25 15:02 tscholak