EEP: Allow per-question model over-rides

Open johnjosephhorton opened this issue 2 months ago • 1 comments

Notes:

Per-Question Model Assignment: Design Document

Date: 2025-01-23 Status: Planning Phase Goal: Enable different questions within a survey to be answered by different language models

Background

Current Architecture

The EDSL framework currently assigns models at the Interview level:

# Current behavior: All questions use the same model
job = Jobs(survey=my_survey)
job.by(Model("gpt-4"))
results = job.run()  # All questions answered by GPT-4

How it works:

Jobs (edsl/jobs/jobs.py): Creates cartesian product of agents × scenarios × models × surveys
Interview (edsl/interviews/interview.py): Each interview = ONE agent + ONE scenario + ONE model + survey
Invigilator (edsl/invigilators/invigilator_base.py): Administers each question using the interview's model
The model flows: Jobs → Interview → Invigilator → Question answering

Key insight: The model is stored at the Interview level (self.model in interview.py:148), meaning all questions in a survey share the same model within an interview.

Desired Behavior

Enable users to specify different models for different questions:

# Desired: Different questions use different models
job = Jobs(survey=my_survey)
job.by(Model("gpt-3.5-turbo"))  # Default for most questions
job.set_question_models({
    "complex_analysis": Model("gpt-4"),           # Use powerful model for hard question
    "creative_writing": Model("claude-3-5-sonnet"), # Use Claude for creative task
    "simple_yes_no": Model("gpt-3.5-turbo")       # Fast model for simple question
})
results = job.run()

Use Cases

Cost optimization: Use cheaper models for simple questions, expensive models for complex ones
Specialized capabilities: Route questions to models with specific strengths
Experimentation: Compare how different models answer the same question within a survey
Compliance: Use specific models for regulated/audited questions

Proposed Solution: Jobs-Level Override

High-Level Design

Add a set_question_models() method to the Jobs class that allows mapping question names to specific models:

class Jobs:
    def __init__(self, survey, agents=None, models=None, scenarios=None):
        # ... existing code ...
        self._question_models = {}  # Dict[str, LanguageModel]

    def set_question_models(self, question_models: dict) -> "Jobs":
        """Assign specific models to specific questions."""
        # Validate and store
        self._question_models = question_models
        return self

    def set_question_model(self, question_name: str, model: LanguageModel) -> "Jobs":
        """Assign a specific model to a single question."""
        self._question_models[question_name] = model
        return self

Implementation Points

The change flows through the system:

Jobs stores _question_models dict
InterviewsConstructor passes question_models to Interview
Interview stores both self.model (default) and self.question_models (overrides)
FetchInvigilator selects appropriate model: question_models.get(q_name, default_model)
Invigilator receives the correct model and uses it for that specific question

Why Jobs-Level?

Advantages:

✅ Separation of concerns: Survey defines WHAT to ask, Jobs defines HOW to execute
✅ Reusability: Same survey can be run with different model assignments
✅ Natural fit: Jobs already manages agents, scenarios, and models
✅ Clear semantics: Explicit about which questions get which models
✅ Efficient: Doesn't multiply the number of interviews

Alternatives considered:

Question-level: Would require duplicating model in each question instance
Survey-level: Would tie execution details to survey structure
Model selector function: Would be hard to serialize and reason about

Open Design Questions

1. Interview Structure: Single vs. Multiple

Option A: Single Interview with Routing (Current proposal)

One interview per agent×scenario combination. The FetchInvigilator routes each question to its assigned model.

# 1 agent × 1 scenario × 1 default model = 1 interview
# But questions internally use different models
Interview(agent=a, scenario=s, model=default_model, question_models={...})

Pros:

Simpler conceptual model: 1 interview = 1 agent × 1 scenario
Results naturally grouped by agent/scenario
Fewer interviews to track and manage
Survey coherence maintained (skip logic, memory plan work naturally)

Cons:

Interview hash becomes more complex (must include question_models)
Results representation: What model to show at interview level?
Cache lookups per question (not per interview) - but this works correctly

Option B: Multiple Interviews (Split by Model)

Create separate interviews for each unique model used in the survey.

# If q1→model_a, q2→model_b, q3→model_a
# Create 2 interviews:
#   Interview 1: Survey([q1, q3]) with model_a
#   Interview 2: Survey([q2]) with model_b

Pros:

Clean separation: each interview = one model
Cache works naturally (interview-level caching)
Interview hash is simple
Results clearly show one model per interview

Cons:

❌ Breaking change: Survey must be split into sub-surveys
❌ Skip logic breaks: q2 might depend on q1's answer, but they're in different interviews
❌ Memory plan coordination: Complex to manage across split surveys
❌ More interviews to create and track

Recommendation: Option A (Single Interview with Routing)

Maintains survey coherence
Skip logic and memory plans work correctly
More intuitive for users

2. Results Representation

Current Results structure:

# One row per interview
# Columns: agent, scenario, model, + all question answers
results.select("model")  # Shows the interview's model
results.select("answer.q1")  # Shows q1's answer

Problem: If questions use different models, what goes in the model column?

Option A: Interview-level model + Question-level tracking

Keep existing structure, add question-level model information:

results.select("model")  # Shows default/primary model
results.select("answer.q1._model")  # Shows actual model used for q1
results.select("answer.q2._model")  # Shows actual model used for q2

Pros:

✅ Backward compatible
✅ Existing code works unchanged
✅ Question-level details available when needed

Cons:

Model column might be misleading (not all questions used that model)
Need to document that model shows default, not necessarily used

Option B: Explode to Question-Level Rows

# One row per question answered
# agent | scenario | question | model | answer

Pros:

Clear: each row shows exactly which model answered which question
Easy to filter/group by model

Cons:

❌ Breaking change: Completely different results structure
❌ Doesn't match current EDSL paradigm (interviews are atomic)
❌ Would require major refactoring throughout codebase

Option C: Model Mapping in Metadata

Keep current structure, add metadata field:

result.model  # Default model
result.question_models  # Dict[str, str] mapping questions to models used

Pros:

Backward compatible
Explicit about what happened

Cons:

Information in two places (model vs question_models)
Requires accessing metadata to understand full picture

Recommendation: Option A (Question-level tracking)

Store model info with each answer: answer.{question_name}._model
Keep model column for default/primary model
Document clearly

3. Model List Management

Jobs tracks models in self.models (ModelList) for:

Creating bucket collections (rate limiting)
Generating interviews (cartesian product)
Cost estimation
Reporting

Current:

job.by(Model("gpt-4"))  # self.models = [gpt-4]
len(job)  # Number of interviews = agents × scenarios × models

With question-specific models:

job.by(Model("gpt-3.5-turbo"))  # self.models = [gpt-3.5-turbo]
job.set_question_models({
    "q1": Model("gpt-4"),
    "q2": Model("claude-3-5-sonnet")
})
# Now we have 3 models total: gpt-3.5-turbo (default), gpt-4, claude

Question: Should self.models include override models?

Option A: Merge into self.models

def set_question_models(self, question_models):
    self._question_models = question_models
    # Auto-add override models to self.models
    for model in question_models.values():
        if model not in self.models:
            self.models.append(model)

Pros:

Bucket collection automatically includes all models
job.models shows all models that will be used
Single source of truth

Cons:

len(job) would change (includes override models in count)
Confusing: job.by(model) creates new interviews, but set_question_models() doesn't?
Affects cartesian product in interview generation

Option B: Keep Separate

def set_question_models(self, question_models):
    self._question_models = question_models  # Separate storage

def create_bucket_collection(self):
    # Combine both when needed
    all_models = set(self.models) | set(self._question_models.values())
    return BucketCollection.from_models(all_models)

Pros:

Clear separation: self.models for interview generation, _question_models for overrides
len(job) unchanged
No confusion about interview multiplication

Cons:

Models in two places
Need to remember to check both when working with models
Need custom logic in several places (bucket collection, cost estimation, etc.)

Recommendation: Option B (Keep Separate)

Clearer semantics
Doesn't affect interview count calculations
Explicit about override nature

4. Caching Implications

Cache key currently includes: agent + scenario + model + question + survey_context

Scenario:

# Job A: Use gpt-4 as default, override q1 to gpt-3.5
job_a.by(Model("gpt-4"))
job_a.set_question_model("q1", Model("gpt-3.5-turbo"))

# Job B: Use gpt-3.5 as default
job_b.by(Model("gpt-3.5-turbo"))

Question: Should q1 in Job A and Job B share the same cache entry?

Answer: Yes, and it will work correctly because:

Invigilator receives the actual model used (gpt-3.5 in both cases)
Cache key is generated with that model
Cache lookup/storage works at the invigilator level

Example validation:

# Job 1: Run all questions with gpt-4
job1.by(Model("gpt-4"))
results1 = job1.run()  # All questions cached with gpt-4

# Job 2: Override q2 to gpt-3.5
job2.by(Model("gpt-4"))
job2.set_question_model("q2", Model("gpt-3.5-turbo"))
results2 = job2.run()
# Expected: q1, q3 hit cache (gpt-4), q2 misses (different model)

Verification: This should work correctly with our design because FetchInvigilator passes the question-specific model to the invigilator before cache lookup.

No changes needed to caching logic ✓

5. Interaction with Jobs.by()

The by() method detects object type and routes appropriately:

job.by(Model("gpt-4"))  # Adds to self.models
job.by(Agent(traits={}))  # Adds to self.agents
job.by(Scenario({}))  # Adds to self.scenarios

Should we support automatic detection of question-model dicts?

Option A: Magic detection

job.by({"q1": Model("gpt-4"), "q2": Model("claude")})
# Automatically calls set_question_models()

Pros:

Consistent with by() pattern
One method for everything

Cons:

Less explicit
Dict could be confused with Scenario
Harder to document/understand

Option B: Explicit method

job.set_question_models({"q1": Model("gpt-4"), "q2": Model("claude")})

Pros:

Clear and explicit
No ambiguity
Easier to document

Cons:

Different pattern than by()

Recommendation: Option B (Explicit)

Clarity over cleverness
Different enough from by() to warrant separate method

6. Cost Estimation

Current implementation in jobs_pricing_estimation.py assumes all questions use the same model.

With question-specific models, need to calculate per-question:

def estimate_job_cost(self, iterations=1):
    total_cost = 0
    for interview in self.interviews():
        for question in self.survey.questions:
            # Get model for this question
            model = interview.question_models.get(
                question.question_name,
                interview.model
            )
            # Estimate cost for this specific question + model
            question_cost = estimate_question_cost(question, model, ...)
            total_cost += question_cost * iterations
    return total_cost

Implementation needed:

Update estimate_job_cost() to check for question-specific models
Update estimate_prompt_cost() to accept per-question model info
Ensure token counting uses correct model's tokenizer

Files to modify:

edsl/jobs/jobs_pricing_estimation.py

7. Validation and Error Handling

A. Question name doesn't exist

job.set_question_model("nonexistent_question", model)

Options:

Raise immediately: ValueError("Question 'nonexistent_question' not found in survey")
Wait until run(): Defer validation

Recommendation: Raise immediately for fast feedback

Implementation:

def set_question_models(self, question_models):
    # Validate all question names exist
    invalid = set(question_models.keys()) - set(self.survey.question_names)
    if invalid:
        raise ValueError(
            f"Questions not found in survey: {invalid}. "
            f"Available: {self.survey.question_names}"
        )
    self._question_models = question_models

B. Model not in self.models

job.by(Model("gpt-3.5-turbo"))  # Only this in self.models
job.set_question_model("q1", Model("gpt-4"))  # Different model

Options:

Require: Must add all models via by() first
Allow: Auto-add to bucket collection when needed

Recommendation: Allow

More flexible
User intent is clear
Just ensure bucket collection includes it

C. All questions overridden (no default model used)

job = Jobs(survey)  # No default model
job.set_question_models({
    q.question_name: some_model
    for q in survey.questions
})  # All questions covered

Options:

Require: Must call by(model) to set default
Allow: If all questions covered, default not needed

Recommendation: Allow

If all questions have explicit assignments, default isn't used
Still need default for interview creation though
Better: Require default, but it's okay if unused

D. Multiple default models with overrides

job.by([Model("gpt-3.5"), Model("gpt-4")])  # 2 default models
job.set_question_models({"q1": Model("claude")})
# Creates 2 interviews:
#   Interview 1: default=gpt-3.5, q1=claude
#   Interview 2: default=gpt-4, q1=claude

Question: Is this confusing? Should we allow it?

Recommendation: Allow

Consistent with existing Jobs behavior
Overrides are overrides regardless of default
Might be useful for experiments

8. Documentation and User Mental Model

How should we explain this feature to users?

Option 1: "Override" Framing

By default, all questions in an interview use the model(s) specified with job.by(model). You can override specific questions to use different models with set_question_models().

Option 2: "Assignment" Framing

Assign specific models to specific questions using set_question_models(). Questions without explicit assignments use the default model(s) from job.by(model).

Option 3: "Routing" Framing

Jobs route each question to its assigned model. Configure routing with set_question_models(). Questions without routing use the interview's default model.

Recommendation: Option 1 ("Override")

Matches existing mental model (by() is primary)
Clearest about precedence
"Override" conveys temporary/exceptional nature

Documentation structure:

# Per-Question Model Assignment

## Basic Usage

By default, all questions use the model specified in `by()`:

```python
job = Jobs(survey).by(Model("gpt-3.5-turbo"))
# All questions use gpt-3.5-turbo

Overriding Specific Questions

Use set_question_models() to assign different models to specific questions:

job = Jobs(survey)
job.by(Model("gpt-3.5-turbo"))  # Default for all questions
job.set_question_models({
    "complex_question": Model("gpt-4"),  # Override this one
    "creative_question": Model("claude-3-5-sonnet"),  # And this one
})

Priority

Question-specific models always take priority over defaults:

Check question_models for question-specific assignment
Fall back to interview's default model

Use Cases

Cost optimization: Cheap models for simple questions, expensive for complex
Specialized capabilities: Route to models with specific strengths
A/B testing: Compare models within the same survey


---

### 9. Serialization

Jobs must be serializable (save/load via to_dict/from_dict).

**Add to `to_dict()`:**
```python
def to_dict(self, add_edsl_version=True):
    d = {
        "survey": self.survey.to_dict(),
        "agents": [...],
        "models": [...],
        "scenarios": [...],
    }

    # NEW: Serialize question_models if present
    if self._question_models:
        d["question_models"] = {
            qname: model.to_dict(add_edsl_version=add_edsl_version)
            for qname, model in self._question_models.items()
        }

    return d

Add to from_dict():

@classmethod
def from_dict(cls, data):
    job = cls(
        survey=Survey.from_dict(data["survey"]),
        agents=[...],
        models=[...],
        scenarios=[...],
    )

    # NEW: Restore question_models if present
    if "question_models" in data:
        from ..language_models import LanguageModel
        job._question_models = {
            qname: LanguageModel.from_dict(model_dict)
            for qname, model_dict in data["question_models"].items()
        }

    return job

10. Bucket Collection (Rate Limiting)

Bucket collection manages API rate limits per model.

Current:

def create_bucket_collection(self):
    return BucketCollection.from_models(self.models)

With question-specific models:

def create_bucket_collection(self):
    # Include all models: defaults + overrides
    all_models = set(self.models) | set(self._question_models.values())
    return BucketCollection.from_models(all_models)

Location: edsl/jobs/jobs.py:759-784

Summary of Recommendations

Decision Point	Recommendation	Rationale
Interview structure	Single interview with routing	Maintains survey coherence, simpler
Results representation	Question-level tracking in answers	Backward compatible, detailed when needed
Model list management	Keep separate (self.models vs _question_models)	Clear semantics, no interview count confusion
Caching	No changes needed	Works correctly as-is
API design	Explicit `set_question_models()` method	Clear and unambiguous
Cost estimation	Update to check per-question models	Accurate cost calculation
Validation	Immediate (at set time)	Fast feedback to user
Allow override all	Yes, but still require default model	Flexible, but safe
Documentation	"Override" framing	Matches existing mental model
Serialization	Add question_models to to_dict/from_dict	Persistence support
Bucket collection	Include all models (defaults + overrides)	Proper rate limiting

Implementation Checklist

When implementation begins, modify these files:

Core Implementation

[ ] edsl/jobs/jobs.py
- [ ] Add _question_models attribute to __init__
- [ ] Add set_question_models() method
- [ ] Add set_question_model() convenience method
- [ ] Update create_bucket_collection() to include override models
- [ ] Update to_dict() to serialize question_models
- [ ] Update from_dict() to deserialize question_models
[ ] edsl/jobs/jobs_interview_constructor.py
- [ ] Pass question_models to Interview constructor
[ ] edsl/interviews/interview.py
- [ ] Add question_models parameter to __init__
- [ ] Store as instance attribute
[ ] edsl/jobs/fetch_invigilator.py
- [ ] Add get_model_for_question() method
- [ ] Update get_invigilator() to use question-specific model

Cost Estimation

[ ] edsl/jobs/jobs_pricing_estimation.py
- [ ] Update estimate_job_cost() to check per-question models
- [ ] Ensure correct tokenizer used per model

Results Tracking

[ ] edsl/results/*.py
- [ ] Store model info with each answer
- [ ] Add answer.{question_name}._model field
- [ ] Update documentation

Testing

[ ] Unit tests for set_question_models()
[ ] Integration test: simple override scenario
[ ] Integration test: all questions overridden
[ ] Integration test: multiple agents × scenarios × mixed models
[ ] Test serialization/deserialization
[ ] Test cost estimation with mixed models
[ ] Test caching behavior
[ ] Test bucket collection includes all models
[ ] Test validation (invalid question names)

Documentation

[ ] Update Jobs documentation
[ ] Add usage examples
[ ] Update cost estimation docs
[ ] Add FAQ about model selection
[ ] Update tutorial/cookbook examples

Open Questions for Discussion

Default model requirement: Should we require a default model even if all questions have overrides?
- Current thinking: Yes, because Interview still needs a model attribute
Results API: Should we add a convenience method like results.get_model_for_question(question_name) or is result.answer.{q}._model sufficient?
Validation timing: Should we also validate at run() time (in case survey changes), or only at set time?
Multiple calls: What happens with:
```
job.set_question_models({"q1": model_a})
job.set_question_models({"q2": model_b})  # Replaces or merges?
```
Current thinking: Replace (like assignment). Add separate add_question_model() if merging needed.

Type hints: Should question_models accept Union[LanguageModel, str] where str is model name?

job.set_question_models({
    "q1": "gpt-4",  # Automatic Model creation?
    "q2": Model("claude-3-5-sonnet"),
})

Next Steps

Review this document with stakeholders/team
Make decisions on open questions
Create implementation branch
Write tests first (TDD approach)
Implement core functionality
Update documentation
Code review and iterate
Merge to main

Related Files and Context

Key Files to Study

edsl/jobs/jobs.py (lines 110-164): Jobs.init and model management
edsl/jobs/jobs_interview_constructor.py (lines 29-92): Interview creation
edsl/interviews/interview.py (lines 99-148): Interview.init
edsl/jobs/fetch_invigilator.py (lines 67-86): Invigilator creation with model
edsl/invigilators/invigilator_base.py (lines 60-126): How invigilators use models

Architecture Flow

User Code
  ↓
Jobs.by(model) → stores in self.models
Jobs.set_question_models({...}) → stores in self._question_models
  ↓
Jobs.run()
  ↓
Jobs.generate_interviews()
  ↓
InterviewsConstructor.create_interviews()
  ↓
Interview.__init__(model, question_models)
  ↓
Interview.async_conduct_interview()
  ↓
FetchInvigilator.get_invigilator(question)
  ↓ [NEW: Check question_models here]
FetchInvigilator.get_model_for_question(question)
  ↓
InvigilatorBase.__init__(model=selected_model)
  ↓
Model.async_execute_model_call()

Testing Strategy

Unit Tests:

set_question_models() validation
set_question_model() single assignment
get_model_for_question() lookup logic
Serialization round-trip

Integration Tests:

Full job run with mixed models
Cost estimation accuracy
Cache hit/miss behavior
Results structure with question-level models

Edge Cases:

Empty question_models
All questions overridden
Invalid question names
Model not in models list
Multiple calls to set_question_models()

Version History

2025-01-23: Initial design document created during planning phase

Oct 23 '25 21:10 johnjosephhorton

@johnjosephhorton Can I work on this?

Oct 27 '25 11:10 AlanPonnachan