edsl icon indicating copy to clipboard operation
edsl copied to clipboard

EEP: Allow per-question model over-rides

Open johnjosephhorton opened this issue 2 months ago • 1 comments

Notes:

Per-Question Model Assignment: Design Document

Date: 2025-01-23 Status: Planning Phase Goal: Enable different questions within a survey to be answered by different language models

Background

Current Architecture

The EDSL framework currently assigns models at the Interview level:

# Current behavior: All questions use the same model
job = Jobs(survey=my_survey)
job.by(Model("gpt-4"))
results = job.run()  # All questions answered by GPT-4

How it works:

  1. Jobs (edsl/jobs/jobs.py): Creates cartesian product of agents × scenarios × models × surveys
  2. Interview (edsl/interviews/interview.py): Each interview = ONE agent + ONE scenario + ONE model + survey
  3. Invigilator (edsl/invigilators/invigilator_base.py): Administers each question using the interview's model
  4. The model flows: Jobs → Interview → Invigilator → Question answering

Key insight: The model is stored at the Interview level (self.model in interview.py:148), meaning all questions in a survey share the same model within an interview.

Desired Behavior

Enable users to specify different models for different questions:

# Desired: Different questions use different models
job = Jobs(survey=my_survey)
job.by(Model("gpt-3.5-turbo"))  # Default for most questions
job.set_question_models({
    "complex_analysis": Model("gpt-4"),           # Use powerful model for hard question
    "creative_writing": Model("claude-3-5-sonnet"), # Use Claude for creative task
    "simple_yes_no": Model("gpt-3.5-turbo")       # Fast model for simple question
})
results = job.run()

Use Cases

  1. Cost optimization: Use cheaper models for simple questions, expensive models for complex ones
  2. Specialized capabilities: Route questions to models with specific strengths
  3. Experimentation: Compare how different models answer the same question within a survey
  4. Compliance: Use specific models for regulated/audited questions

Proposed Solution: Jobs-Level Override

High-Level Design

Add a set_question_models() method to the Jobs class that allows mapping question names to specific models:

class Jobs:
    def __init__(self, survey, agents=None, models=None, scenarios=None):
        # ... existing code ...
        self._question_models = {}  # Dict[str, LanguageModel]

    def set_question_models(self, question_models: dict) -> "Jobs":
        """Assign specific models to specific questions."""
        # Validate and store
        self._question_models = question_models
        return self

    def set_question_model(self, question_name: str, model: LanguageModel) -> "Jobs":
        """Assign a specific model to a single question."""
        self._question_models[question_name] = model
        return self

Implementation Points

The change flows through the system:

  1. Jobs stores _question_models dict
  2. InterviewsConstructor passes question_models to Interview
  3. Interview stores both self.model (default) and self.question_models (overrides)
  4. FetchInvigilator selects appropriate model: question_models.get(q_name, default_model)
  5. Invigilator receives the correct model and uses it for that specific question

Why Jobs-Level?

Advantages:

  • ✅ Separation of concerns: Survey defines WHAT to ask, Jobs defines HOW to execute
  • ✅ Reusability: Same survey can be run with different model assignments
  • ✅ Natural fit: Jobs already manages agents, scenarios, and models
  • ✅ Clear semantics: Explicit about which questions get which models
  • ✅ Efficient: Doesn't multiply the number of interviews

Alternatives considered:

  • Question-level: Would require duplicating model in each question instance
  • Survey-level: Would tie execution details to survey structure
  • Model selector function: Would be hard to serialize and reason about

Open Design Questions

1. Interview Structure: Single vs. Multiple

Option A: Single Interview with Routing (Current proposal)

One interview per agent×scenario combination. The FetchInvigilator routes each question to its assigned model.

# 1 agent × 1 scenario × 1 default model = 1 interview
# But questions internally use different models
Interview(agent=a, scenario=s, model=default_model, question_models={...})

Pros:

  • Simpler conceptual model: 1 interview = 1 agent × 1 scenario
  • Results naturally grouped by agent/scenario
  • Fewer interviews to track and manage
  • Survey coherence maintained (skip logic, memory plan work naturally)

Cons:

  • Interview hash becomes more complex (must include question_models)
  • Results representation: What model to show at interview level?
  • Cache lookups per question (not per interview) - but this works correctly

Option B: Multiple Interviews (Split by Model)

Create separate interviews for each unique model used in the survey.

# If q1→model_a, q2→model_b, q3→model_a
# Create 2 interviews:
#   Interview 1: Survey([q1, q3]) with model_a
#   Interview 2: Survey([q2]) with model_b

Pros:

  • Clean separation: each interview = one model
  • Cache works naturally (interview-level caching)
  • Interview hash is simple
  • Results clearly show one model per interview

Cons:

  • Breaking change: Survey must be split into sub-surveys
  • Skip logic breaks: q2 might depend on q1's answer, but they're in different interviews
  • Memory plan coordination: Complex to manage across split surveys
  • ❌ More interviews to create and track

Recommendation: Option A (Single Interview with Routing)

  • Maintains survey coherence
  • Skip logic and memory plans work correctly
  • More intuitive for users

2. Results Representation

Current Results structure:

# One row per interview
# Columns: agent, scenario, model, + all question answers
results.select("model")  # Shows the interview's model
results.select("answer.q1")  # Shows q1's answer

Problem: If questions use different models, what goes in the model column?

Option A: Interview-level model + Question-level tracking

Keep existing structure, add question-level model information:

results.select("model")  # Shows default/primary model
results.select("answer.q1._model")  # Shows actual model used for q1
results.select("answer.q2._model")  # Shows actual model used for q2

Pros:

  • ✅ Backward compatible
  • ✅ Existing code works unchanged
  • ✅ Question-level details available when needed

Cons:

  • Model column might be misleading (not all questions used that model)
  • Need to document that model shows default, not necessarily used

Option B: Explode to Question-Level Rows

# One row per question answered
# agent | scenario | question | model | answer

Pros:

  • Clear: each row shows exactly which model answered which question
  • Easy to filter/group by model

Cons:

  • Breaking change: Completely different results structure
  • ❌ Doesn't match current EDSL paradigm (interviews are atomic)
  • ❌ Would require major refactoring throughout codebase

Option C: Model Mapping in Metadata

Keep current structure, add metadata field:

result.model  # Default model
result.question_models  # Dict[str, str] mapping questions to models used

Pros:

  • Backward compatible
  • Explicit about what happened

Cons:

  • Information in two places (model vs question_models)
  • Requires accessing metadata to understand full picture

Recommendation: Option A (Question-level tracking)

  • Store model info with each answer: answer.{question_name}._model
  • Keep model column for default/primary model
  • Document clearly

3. Model List Management

Jobs tracks models in self.models (ModelList) for:

  • Creating bucket collections (rate limiting)
  • Generating interviews (cartesian product)
  • Cost estimation
  • Reporting

Current:

job.by(Model("gpt-4"))  # self.models = [gpt-4]
len(job)  # Number of interviews = agents × scenarios × models

With question-specific models:

job.by(Model("gpt-3.5-turbo"))  # self.models = [gpt-3.5-turbo]
job.set_question_models({
    "q1": Model("gpt-4"),
    "q2": Model("claude-3-5-sonnet")
})
# Now we have 3 models total: gpt-3.5-turbo (default), gpt-4, claude

Question: Should self.models include override models?

Option A: Merge into self.models

def set_question_models(self, question_models):
    self._question_models = question_models
    # Auto-add override models to self.models
    for model in question_models.values():
        if model not in self.models:
            self.models.append(model)

Pros:

  • Bucket collection automatically includes all models
  • job.models shows all models that will be used
  • Single source of truth

Cons:

  • len(job) would change (includes override models in count)
  • Confusing: job.by(model) creates new interviews, but set_question_models() doesn't?
  • Affects cartesian product in interview generation

Option B: Keep Separate

def set_question_models(self, question_models):
    self._question_models = question_models  # Separate storage

def create_bucket_collection(self):
    # Combine both when needed
    all_models = set(self.models) | set(self._question_models.values())
    return BucketCollection.from_models(all_models)

Pros:

  • Clear separation: self.models for interview generation, _question_models for overrides
  • len(job) unchanged
  • No confusion about interview multiplication

Cons:

  • Models in two places
  • Need to remember to check both when working with models
  • Need custom logic in several places (bucket collection, cost estimation, etc.)

Recommendation: Option B (Keep Separate)

  • Clearer semantics
  • Doesn't affect interview count calculations
  • Explicit about override nature

4. Caching Implications

Cache key currently includes: agent + scenario + model + question + survey_context

Scenario:

# Job A: Use gpt-4 as default, override q1 to gpt-3.5
job_a.by(Model("gpt-4"))
job_a.set_question_model("q1", Model("gpt-3.5-turbo"))

# Job B: Use gpt-3.5 as default
job_b.by(Model("gpt-3.5-turbo"))

Question: Should q1 in Job A and Job B share the same cache entry?

Answer: Yes, and it will work correctly because:

  • Invigilator receives the actual model used (gpt-3.5 in both cases)
  • Cache key is generated with that model
  • Cache lookup/storage works at the invigilator level

Example validation:

# Job 1: Run all questions with gpt-4
job1.by(Model("gpt-4"))
results1 = job1.run()  # All questions cached with gpt-4

# Job 2: Override q2 to gpt-3.5
job2.by(Model("gpt-4"))
job2.set_question_model("q2", Model("gpt-3.5-turbo"))
results2 = job2.run()
# Expected: q1, q3 hit cache (gpt-4), q2 misses (different model)

Verification: This should work correctly with our design because FetchInvigilator passes the question-specific model to the invigilator before cache lookup.

No changes needed to caching logic ✓


5. Interaction with Jobs.by()

The by() method detects object type and routes appropriately:

job.by(Model("gpt-4"))  # Adds to self.models
job.by(Agent(traits={}))  # Adds to self.agents
job.by(Scenario({}))  # Adds to self.scenarios

Should we support automatic detection of question-model dicts?

Option A: Magic detection

job.by({"q1": Model("gpt-4"), "q2": Model("claude")})
# Automatically calls set_question_models()

Pros:

  • Consistent with by() pattern
  • One method for everything

Cons:

  • Less explicit
  • Dict could be confused with Scenario
  • Harder to document/understand

Option B: Explicit method

job.set_question_models({"q1": Model("gpt-4"), "q2": Model("claude")})

Pros:

  • Clear and explicit
  • No ambiguity
  • Easier to document

Cons:

  • Different pattern than by()

Recommendation: Option B (Explicit)

  • Clarity over cleverness
  • Different enough from by() to warrant separate method

6. Cost Estimation

Current implementation in jobs_pricing_estimation.py assumes all questions use the same model.

With question-specific models, need to calculate per-question:

def estimate_job_cost(self, iterations=1):
    total_cost = 0
    for interview in self.interviews():
        for question in self.survey.questions:
            # Get model for this question
            model = interview.question_models.get(
                question.question_name,
                interview.model
            )
            # Estimate cost for this specific question + model
            question_cost = estimate_question_cost(question, model, ...)
            total_cost += question_cost * iterations
    return total_cost

Implementation needed:

  1. Update estimate_job_cost() to check for question-specific models
  2. Update estimate_prompt_cost() to accept per-question model info
  3. Ensure token counting uses correct model's tokenizer

Files to modify:

  • edsl/jobs/jobs_pricing_estimation.py

7. Validation and Error Handling

A. Question name doesn't exist

job.set_question_model("nonexistent_question", model)

Options:

  • Raise immediately: ValueError("Question 'nonexistent_question' not found in survey")
  • Wait until run(): Defer validation

Recommendation: Raise immediately for fast feedback

Implementation:

def set_question_models(self, question_models):
    # Validate all question names exist
    invalid = set(question_models.keys()) - set(self.survey.question_names)
    if invalid:
        raise ValueError(
            f"Questions not found in survey: {invalid}. "
            f"Available: {self.survey.question_names}"
        )
    self._question_models = question_models

B. Model not in self.models

job.by(Model("gpt-3.5-turbo"))  # Only this in self.models
job.set_question_model("q1", Model("gpt-4"))  # Different model

Options:

  • Require: Must add all models via by() first
  • Allow: Auto-add to bucket collection when needed

Recommendation: Allow

  • More flexible
  • User intent is clear
  • Just ensure bucket collection includes it

C. All questions overridden (no default model used)

job = Jobs(survey)  # No default model
job.set_question_models({
    q.question_name: some_model
    for q in survey.questions
})  # All questions covered

Options:

  • Require: Must call by(model) to set default
  • Allow: If all questions covered, default not needed

Recommendation: Allow

  • If all questions have explicit assignments, default isn't used
  • Still need default for interview creation though
  • Better: Require default, but it's okay if unused

D. Multiple default models with overrides

job.by([Model("gpt-3.5"), Model("gpt-4")])  # 2 default models
job.set_question_models({"q1": Model("claude")})
# Creates 2 interviews:
#   Interview 1: default=gpt-3.5, q1=claude
#   Interview 2: default=gpt-4, q1=claude

Question: Is this confusing? Should we allow it?

Recommendation: Allow

  • Consistent with existing Jobs behavior
  • Overrides are overrides regardless of default
  • Might be useful for experiments

8. Documentation and User Mental Model

How should we explain this feature to users?

Option 1: "Override" Framing

By default, all questions in an interview use the model(s) specified with job.by(model). You can override specific questions to use different models with set_question_models().

Option 2: "Assignment" Framing

Assign specific models to specific questions using set_question_models(). Questions without explicit assignments use the default model(s) from job.by(model).

Option 3: "Routing" Framing

Jobs route each question to its assigned model. Configure routing with set_question_models(). Questions without routing use the interview's default model.

Recommendation: Option 1 ("Override")

  • Matches existing mental model (by() is primary)
  • Clearest about precedence
  • "Override" conveys temporary/exceptional nature

Documentation structure:

# Per-Question Model Assignment

## Basic Usage

By default, all questions use the model specified in `by()`:

```python
job = Jobs(survey).by(Model("gpt-3.5-turbo"))
# All questions use gpt-3.5-turbo

Overriding Specific Questions

Use set_question_models() to assign different models to specific questions:

job = Jobs(survey)
job.by(Model("gpt-3.5-turbo"))  # Default for all questions
job.set_question_models({
    "complex_question": Model("gpt-4"),  # Override this one
    "creative_question": Model("claude-3-5-sonnet"),  # And this one
})

Priority

Question-specific models always take priority over defaults:

  1. Check question_models for question-specific assignment
  2. Fall back to interview's default model

Use Cases

  • Cost optimization: Cheap models for simple questions, expensive for complex
  • Specialized capabilities: Route to models with specific strengths
  • A/B testing: Compare models within the same survey

---

### 9. Serialization

Jobs must be serializable (save/load via to_dict/from_dict).

**Add to `to_dict()`:**
```python
def to_dict(self, add_edsl_version=True):
    d = {
        "survey": self.survey.to_dict(),
        "agents": [...],
        "models": [...],
        "scenarios": [...],
    }

    # NEW: Serialize question_models if present
    if self._question_models:
        d["question_models"] = {
            qname: model.to_dict(add_edsl_version=add_edsl_version)
            for qname, model in self._question_models.items()
        }

    return d

Add to from_dict():

@classmethod
def from_dict(cls, data):
    job = cls(
        survey=Survey.from_dict(data["survey"]),
        agents=[...],
        models=[...],
        scenarios=[...],
    )

    # NEW: Restore question_models if present
    if "question_models" in data:
        from ..language_models import LanguageModel
        job._question_models = {
            qname: LanguageModel.from_dict(model_dict)
            for qname, model_dict in data["question_models"].items()
        }

    return job

10. Bucket Collection (Rate Limiting)

Bucket collection manages API rate limits per model.

Current:

def create_bucket_collection(self):
    return BucketCollection.from_models(self.models)

With question-specific models:

def create_bucket_collection(self):
    # Include all models: defaults + overrides
    all_models = set(self.models) | set(self._question_models.values())
    return BucketCollection.from_models(all_models)

Location: edsl/jobs/jobs.py:759-784


Summary of Recommendations

Decision Point Recommendation Rationale
Interview structure Single interview with routing Maintains survey coherence, simpler
Results representation Question-level tracking in answers Backward compatible, detailed when needed
Model list management Keep separate (self.models vs _question_models) Clear semantics, no interview count confusion
Caching No changes needed Works correctly as-is
API design Explicit set_question_models() method Clear and unambiguous
Cost estimation Update to check per-question models Accurate cost calculation
Validation Immediate (at set time) Fast feedback to user
Allow override all Yes, but still require default model Flexible, but safe
Documentation "Override" framing Matches existing mental model
Serialization Add question_models to to_dict/from_dict Persistence support
Bucket collection Include all models (defaults + overrides) Proper rate limiting

Implementation Checklist

When implementation begins, modify these files:

Core Implementation

  • [ ] edsl/jobs/jobs.py

    • [ ] Add _question_models attribute to __init__
    • [ ] Add set_question_models() method
    • [ ] Add set_question_model() convenience method
    • [ ] Update create_bucket_collection() to include override models
    • [ ] Update to_dict() to serialize question_models
    • [ ] Update from_dict() to deserialize question_models
  • [ ] edsl/jobs/jobs_interview_constructor.py

    • [ ] Pass question_models to Interview constructor
  • [ ] edsl/interviews/interview.py

    • [ ] Add question_models parameter to __init__
    • [ ] Store as instance attribute
  • [ ] edsl/jobs/fetch_invigilator.py

    • [ ] Add get_model_for_question() method
    • [ ] Update get_invigilator() to use question-specific model

Cost Estimation

  • [ ] edsl/jobs/jobs_pricing_estimation.py
    • [ ] Update estimate_job_cost() to check per-question models
    • [ ] Ensure correct tokenizer used per model

Results Tracking

  • [ ] edsl/results/*.py
    • [ ] Store model info with each answer
    • [ ] Add answer.{question_name}._model field
    • [ ] Update documentation

Testing

  • [ ] Unit tests for set_question_models()
  • [ ] Integration test: simple override scenario
  • [ ] Integration test: all questions overridden
  • [ ] Integration test: multiple agents × scenarios × mixed models
  • [ ] Test serialization/deserialization
  • [ ] Test cost estimation with mixed models
  • [ ] Test caching behavior
  • [ ] Test bucket collection includes all models
  • [ ] Test validation (invalid question names)

Documentation

  • [ ] Update Jobs documentation
  • [ ] Add usage examples
  • [ ] Update cost estimation docs
  • [ ] Add FAQ about model selection
  • [ ] Update tutorial/cookbook examples

Open Questions for Discussion

  1. Default model requirement: Should we require a default model even if all questions have overrides?

    • Current thinking: Yes, because Interview still needs a model attribute
  2. Results API: Should we add a convenience method like results.get_model_for_question(question_name) or is result.answer.{q}._model sufficient?

  3. Validation timing: Should we also validate at run() time (in case survey changes), or only at set time?

  4. Multiple calls: What happens with:

    job.set_question_models({"q1": model_a})
    job.set_question_models({"q2": model_b})  # Replaces or merges?
    

    Current thinking: Replace (like assignment). Add separate add_question_model() if merging needed.

  5. Type hints: Should question_models accept Union[LanguageModel, str] where str is model name?

    job.set_question_models({
        "q1": "gpt-4",  # Automatic Model creation?
        "q2": Model("claude-3-5-sonnet"),
    })
    

Next Steps

  1. Review this document with stakeholders/team
  2. Make decisions on open questions
  3. Create implementation branch
  4. Write tests first (TDD approach)
  5. Implement core functionality
  6. Update documentation
  7. Code review and iterate
  8. Merge to main

Related Files and Context

Key Files to Study

  • edsl/jobs/jobs.py (lines 110-164): Jobs.init and model management
  • edsl/jobs/jobs_interview_constructor.py (lines 29-92): Interview creation
  • edsl/interviews/interview.py (lines 99-148): Interview.init
  • edsl/jobs/fetch_invigilator.py (lines 67-86): Invigilator creation with model
  • edsl/invigilators/invigilator_base.py (lines 60-126): How invigilators use models

Architecture Flow

User Code
  ↓
Jobs.by(model) → stores in self.models
Jobs.set_question_models({...}) → stores in self._question_models
  ↓
Jobs.run()
  ↓
Jobs.generate_interviews()
  ↓
InterviewsConstructor.create_interviews()
  ↓
Interview.__init__(model, question_models)
  ↓
Interview.async_conduct_interview()
  ↓
FetchInvigilator.get_invigilator(question)
  ↓ [NEW: Check question_models here]
FetchInvigilator.get_model_for_question(question)
  ↓
InvigilatorBase.__init__(model=selected_model)
  ↓
Model.async_execute_model_call()

Testing Strategy

Unit Tests:

  • set_question_models() validation
  • set_question_model() single assignment
  • get_model_for_question() lookup logic
  • Serialization round-trip

Integration Tests:

  • Full job run with mixed models
  • Cost estimation accuracy
  • Cache hit/miss behavior
  • Results structure with question-level models

Edge Cases:

  • Empty question_models
  • All questions overridden
  • Invalid question names
  • Model not in models list
  • Multiple calls to set_question_models()

Version History

  • 2025-01-23: Initial design document created during planning phase

johnjosephhorton avatar Oct 23 '25 21:10 johnjosephhorton

@johnjosephhorton Can I work on this?

AlanPonnachan avatar Oct 27 '25 11:10 AlanPonnachan