Fast-LLM icon indicating copy to clipboard operation
Fast-LLM copied to clipboard

Option to vary configuration parameters across layers

Open jlamypoirier opened this issue 9 months ago • 1 comments

🎯 Goal (What & Why)

We have several use-cases for varying parameters across layers (#147, #153) and will likely have many more in the future.

Best and simplest way to implement this would be a per-layer override mechanism based on #154, ex

transformer:
  [...]
  window_size: 8192
  overrides:
    - layers: 0:24:2
      config:
        window_size: null

🚀 Execution Plan

This is relatively simple to do once we have an override mechanism (#154)

Step 1: What is the smallest working version?

(Describe the simplest way to implement this feature with minimal effort.)

Step 2: What additional optimizations are possible (but optional)?

(List potential refinements that can be added in later PRs if needed.)

📌 Acceptance Criteria (Must-Haves for Completion)

  • The feature must be functional and tested.
  • The implementation must be documented in practical terms.
  • The PR must include a performance/impact summary.
  • No refactors unless directly necessary for feature completion.

🛠️ Project Management

  • [ ] Assign the project to the Fast-LLM project.
  • [ ] Set the Estimate field (in days) in the GitHub project.
  • [ ] Use the Size field to categorize the PR size (Small/Medium/Large).
  • [ ] Assign an owner when opening the issue.

jlamypoirier avatar Feb 19 '25 21:02 jlamypoirier

let's spell out why and when we would need that:

  • Some models we care about (only Qwen2 at this point) use windowed attention only in some layers but not throughout. this can be supported simply as done in #157, but could eventually be generalized.
  • Qwen2 also adds linear bias terms to q, k, and v, but does that consistently in all transformer blocks and layers. This therefore doesn't require different configurations across layers.
  • We are interested in bringing SSM-transformer hybrids to Fast-LLM. #68 is only the beginning. We will eventually be interested in exploring different stacks of SSM and transformer blocks, but this is a while off.

So the conclusion is that there is no urgency to support this feature at this point in time.

tscholak avatar Feb 20 '25 15:02 tscholak