EEP: Allow per-question model over-rides
Notes:
Per-Question Model Assignment: Design Document
Date: 2025-01-23 Status: Planning Phase Goal: Enable different questions within a survey to be answered by different language models
Background
Current Architecture
The EDSL framework currently assigns models at the Interview level:
# Current behavior: All questions use the same model
job = Jobs(survey=my_survey)
job.by(Model("gpt-4"))
results = job.run() # All questions answered by GPT-4
How it works:
- Jobs (
edsl/jobs/jobs.py): Creates cartesian product of agents × scenarios × models × surveys - Interview (
edsl/interviews/interview.py): Each interview = ONE agent + ONE scenario + ONE model + survey - Invigilator (
edsl/invigilators/invigilator_base.py): Administers each question using the interview's model - The model flows: Jobs → Interview → Invigilator → Question answering
Key insight: The model is stored at the Interview level (self.model in interview.py:148), meaning all questions in a survey share the same model within an interview.
Desired Behavior
Enable users to specify different models for different questions:
# Desired: Different questions use different models
job = Jobs(survey=my_survey)
job.by(Model("gpt-3.5-turbo")) # Default for most questions
job.set_question_models({
"complex_analysis": Model("gpt-4"), # Use powerful model for hard question
"creative_writing": Model("claude-3-5-sonnet"), # Use Claude for creative task
"simple_yes_no": Model("gpt-3.5-turbo") # Fast model for simple question
})
results = job.run()
Use Cases
- Cost optimization: Use cheaper models for simple questions, expensive models for complex ones
- Specialized capabilities: Route questions to models with specific strengths
- Experimentation: Compare how different models answer the same question within a survey
- Compliance: Use specific models for regulated/audited questions
Proposed Solution: Jobs-Level Override
High-Level Design
Add a set_question_models() method to the Jobs class that allows mapping question names to specific models:
class Jobs:
def __init__(self, survey, agents=None, models=None, scenarios=None):
# ... existing code ...
self._question_models = {} # Dict[str, LanguageModel]
def set_question_models(self, question_models: dict) -> "Jobs":
"""Assign specific models to specific questions."""
# Validate and store
self._question_models = question_models
return self
def set_question_model(self, question_name: str, model: LanguageModel) -> "Jobs":
"""Assign a specific model to a single question."""
self._question_models[question_name] = model
return self
Implementation Points
The change flows through the system:
- Jobs stores
_question_modelsdict - InterviewsConstructor passes
question_modelsto Interview - Interview stores both
self.model(default) andself.question_models(overrides) - FetchInvigilator selects appropriate model:
question_models.get(q_name, default_model) - Invigilator receives the correct model and uses it for that specific question
Why Jobs-Level?
Advantages:
- ✅ Separation of concerns: Survey defines WHAT to ask, Jobs defines HOW to execute
- ✅ Reusability: Same survey can be run with different model assignments
- ✅ Natural fit: Jobs already manages agents, scenarios, and models
- ✅ Clear semantics: Explicit about which questions get which models
- ✅ Efficient: Doesn't multiply the number of interviews
Alternatives considered:
- Question-level: Would require duplicating model in each question instance
- Survey-level: Would tie execution details to survey structure
- Model selector function: Would be hard to serialize and reason about
Open Design Questions
1. Interview Structure: Single vs. Multiple
Option A: Single Interview with Routing (Current proposal)
One interview per agent×scenario combination. The FetchInvigilator routes each question to its assigned model.
# 1 agent × 1 scenario × 1 default model = 1 interview
# But questions internally use different models
Interview(agent=a, scenario=s, model=default_model, question_models={...})
Pros:
- Simpler conceptual model: 1 interview = 1 agent × 1 scenario
- Results naturally grouped by agent/scenario
- Fewer interviews to track and manage
- Survey coherence maintained (skip logic, memory plan work naturally)
Cons:
- Interview hash becomes more complex (must include question_models)
- Results representation: What model to show at interview level?
- Cache lookups per question (not per interview) - but this works correctly
Option B: Multiple Interviews (Split by Model)
Create separate interviews for each unique model used in the survey.
# If q1→model_a, q2→model_b, q3→model_a
# Create 2 interviews:
# Interview 1: Survey([q1, q3]) with model_a
# Interview 2: Survey([q2]) with model_b
Pros:
- Clean separation: each interview = one model
- Cache works naturally (interview-level caching)
- Interview hash is simple
- Results clearly show one model per interview
Cons:
- ❌ Breaking change: Survey must be split into sub-surveys
- ❌ Skip logic breaks: q2 might depend on q1's answer, but they're in different interviews
- ❌ Memory plan coordination: Complex to manage across split surveys
- ❌ More interviews to create and track
Recommendation: Option A (Single Interview with Routing)
- Maintains survey coherence
- Skip logic and memory plans work correctly
- More intuitive for users
2. Results Representation
Current Results structure:
# One row per interview
# Columns: agent, scenario, model, + all question answers
results.select("model") # Shows the interview's model
results.select("answer.q1") # Shows q1's answer
Problem: If questions use different models, what goes in the model column?
Option A: Interview-level model + Question-level tracking
Keep existing structure, add question-level model information:
results.select("model") # Shows default/primary model
results.select("answer.q1._model") # Shows actual model used for q1
results.select("answer.q2._model") # Shows actual model used for q2
Pros:
- ✅ Backward compatible
- ✅ Existing code works unchanged
- ✅ Question-level details available when needed
Cons:
- Model column might be misleading (not all questions used that model)
- Need to document that
modelshows default, not necessarily used
Option B: Explode to Question-Level Rows
# One row per question answered
# agent | scenario | question | model | answer
Pros:
- Clear: each row shows exactly which model answered which question
- Easy to filter/group by model
Cons:
- ❌ Breaking change: Completely different results structure
- ❌ Doesn't match current EDSL paradigm (interviews are atomic)
- ❌ Would require major refactoring throughout codebase
Option C: Model Mapping in Metadata
Keep current structure, add metadata field:
result.model # Default model
result.question_models # Dict[str, str] mapping questions to models used
Pros:
- Backward compatible
- Explicit about what happened
Cons:
- Information in two places (model vs question_models)
- Requires accessing metadata to understand full picture
Recommendation: Option A (Question-level tracking)
- Store model info with each answer:
answer.{question_name}._model - Keep
modelcolumn for default/primary model - Document clearly
3. Model List Management
Jobs tracks models in self.models (ModelList) for:
- Creating bucket collections (rate limiting)
- Generating interviews (cartesian product)
- Cost estimation
- Reporting
Current:
job.by(Model("gpt-4")) # self.models = [gpt-4]
len(job) # Number of interviews = agents × scenarios × models
With question-specific models:
job.by(Model("gpt-3.5-turbo")) # self.models = [gpt-3.5-turbo]
job.set_question_models({
"q1": Model("gpt-4"),
"q2": Model("claude-3-5-sonnet")
})
# Now we have 3 models total: gpt-3.5-turbo (default), gpt-4, claude
Question: Should self.models include override models?
Option A: Merge into self.models
def set_question_models(self, question_models):
self._question_models = question_models
# Auto-add override models to self.models
for model in question_models.values():
if model not in self.models:
self.models.append(model)
Pros:
- Bucket collection automatically includes all models
job.modelsshows all models that will be used- Single source of truth
Cons:
len(job)would change (includes override models in count)- Confusing:
job.by(model)creates new interviews, butset_question_models()doesn't? - Affects cartesian product in interview generation
Option B: Keep Separate
def set_question_models(self, question_models):
self._question_models = question_models # Separate storage
def create_bucket_collection(self):
# Combine both when needed
all_models = set(self.models) | set(self._question_models.values())
return BucketCollection.from_models(all_models)
Pros:
- Clear separation:
self.modelsfor interview generation,_question_modelsfor overrides len(job)unchanged- No confusion about interview multiplication
Cons:
- Models in two places
- Need to remember to check both when working with models
- Need custom logic in several places (bucket collection, cost estimation, etc.)
Recommendation: Option B (Keep Separate)
- Clearer semantics
- Doesn't affect interview count calculations
- Explicit about override nature
4. Caching Implications
Cache key currently includes: agent + scenario + model + question + survey_context
Scenario:
# Job A: Use gpt-4 as default, override q1 to gpt-3.5
job_a.by(Model("gpt-4"))
job_a.set_question_model("q1", Model("gpt-3.5-turbo"))
# Job B: Use gpt-3.5 as default
job_b.by(Model("gpt-3.5-turbo"))
Question: Should q1 in Job A and Job B share the same cache entry?
Answer: Yes, and it will work correctly because:
- Invigilator receives the actual model used (gpt-3.5 in both cases)
- Cache key is generated with that model
- Cache lookup/storage works at the invigilator level
Example validation:
# Job 1: Run all questions with gpt-4
job1.by(Model("gpt-4"))
results1 = job1.run() # All questions cached with gpt-4
# Job 2: Override q2 to gpt-3.5
job2.by(Model("gpt-4"))
job2.set_question_model("q2", Model("gpt-3.5-turbo"))
results2 = job2.run()
# Expected: q1, q3 hit cache (gpt-4), q2 misses (different model)
Verification: This should work correctly with our design because FetchInvigilator passes the question-specific model to the invigilator before cache lookup.
No changes needed to caching logic ✓
5. Interaction with Jobs.by()
The by() method detects object type and routes appropriately:
job.by(Model("gpt-4")) # Adds to self.models
job.by(Agent(traits={})) # Adds to self.agents
job.by(Scenario({})) # Adds to self.scenarios
Should we support automatic detection of question-model dicts?
Option A: Magic detection
job.by({"q1": Model("gpt-4"), "q2": Model("claude")})
# Automatically calls set_question_models()
Pros:
- Consistent with by() pattern
- One method for everything
Cons:
- Less explicit
- Dict could be confused with Scenario
- Harder to document/understand
Option B: Explicit method
job.set_question_models({"q1": Model("gpt-4"), "q2": Model("claude")})
Pros:
- Clear and explicit
- No ambiguity
- Easier to document
Cons:
- Different pattern than by()
Recommendation: Option B (Explicit)
- Clarity over cleverness
- Different enough from by() to warrant separate method
6. Cost Estimation
Current implementation in jobs_pricing_estimation.py assumes all questions use the same model.
With question-specific models, need to calculate per-question:
def estimate_job_cost(self, iterations=1):
total_cost = 0
for interview in self.interviews():
for question in self.survey.questions:
# Get model for this question
model = interview.question_models.get(
question.question_name,
interview.model
)
# Estimate cost for this specific question + model
question_cost = estimate_question_cost(question, model, ...)
total_cost += question_cost * iterations
return total_cost
Implementation needed:
- Update
estimate_job_cost()to check for question-specific models - Update
estimate_prompt_cost()to accept per-question model info - Ensure token counting uses correct model's tokenizer
Files to modify:
edsl/jobs/jobs_pricing_estimation.py
7. Validation and Error Handling
A. Question name doesn't exist
job.set_question_model("nonexistent_question", model)
Options:
- Raise immediately:
ValueError("Question 'nonexistent_question' not found in survey") - Wait until run(): Defer validation
Recommendation: Raise immediately for fast feedback
Implementation:
def set_question_models(self, question_models):
# Validate all question names exist
invalid = set(question_models.keys()) - set(self.survey.question_names)
if invalid:
raise ValueError(
f"Questions not found in survey: {invalid}. "
f"Available: {self.survey.question_names}"
)
self._question_models = question_models
B. Model not in self.models
job.by(Model("gpt-3.5-turbo")) # Only this in self.models
job.set_question_model("q1", Model("gpt-4")) # Different model
Options:
- Require: Must add all models via
by()first - Allow: Auto-add to bucket collection when needed
Recommendation: Allow
- More flexible
- User intent is clear
- Just ensure bucket collection includes it
C. All questions overridden (no default model used)
job = Jobs(survey) # No default model
job.set_question_models({
q.question_name: some_model
for q in survey.questions
}) # All questions covered
Options:
- Require: Must call
by(model)to set default - Allow: If all questions covered, default not needed
Recommendation: Allow
- If all questions have explicit assignments, default isn't used
- Still need default for interview creation though
- Better: Require default, but it's okay if unused
D. Multiple default models with overrides
job.by([Model("gpt-3.5"), Model("gpt-4")]) # 2 default models
job.set_question_models({"q1": Model("claude")})
# Creates 2 interviews:
# Interview 1: default=gpt-3.5, q1=claude
# Interview 2: default=gpt-4, q1=claude
Question: Is this confusing? Should we allow it?
Recommendation: Allow
- Consistent with existing Jobs behavior
- Overrides are overrides regardless of default
- Might be useful for experiments
8. Documentation and User Mental Model
How should we explain this feature to users?
Option 1: "Override" Framing
By default, all questions in an interview use the model(s) specified with
job.by(model). You can override specific questions to use different models withset_question_models().
Option 2: "Assignment" Framing
Assign specific models to specific questions using
set_question_models(). Questions without explicit assignments use the default model(s) fromjob.by(model).
Option 3: "Routing" Framing
Jobs route each question to its assigned model. Configure routing with
set_question_models(). Questions without routing use the interview's default model.
Recommendation: Option 1 ("Override")
- Matches existing mental model (by() is primary)
- Clearest about precedence
- "Override" conveys temporary/exceptional nature
Documentation structure:
# Per-Question Model Assignment
## Basic Usage
By default, all questions use the model specified in `by()`:
```python
job = Jobs(survey).by(Model("gpt-3.5-turbo"))
# All questions use gpt-3.5-turbo
Overriding Specific Questions
Use set_question_models() to assign different models to specific questions:
job = Jobs(survey)
job.by(Model("gpt-3.5-turbo")) # Default for all questions
job.set_question_models({
"complex_question": Model("gpt-4"), # Override this one
"creative_question": Model("claude-3-5-sonnet"), # And this one
})
Priority
Question-specific models always take priority over defaults:
- Check
question_modelsfor question-specific assignment - Fall back to interview's default model
Use Cases
- Cost optimization: Cheap models for simple questions, expensive for complex
- Specialized capabilities: Route to models with specific strengths
- A/B testing: Compare models within the same survey
---
### 9. Serialization
Jobs must be serializable (save/load via to_dict/from_dict).
**Add to `to_dict()`:**
```python
def to_dict(self, add_edsl_version=True):
d = {
"survey": self.survey.to_dict(),
"agents": [...],
"models": [...],
"scenarios": [...],
}
# NEW: Serialize question_models if present
if self._question_models:
d["question_models"] = {
qname: model.to_dict(add_edsl_version=add_edsl_version)
for qname, model in self._question_models.items()
}
return d
Add to from_dict():
@classmethod
def from_dict(cls, data):
job = cls(
survey=Survey.from_dict(data["survey"]),
agents=[...],
models=[...],
scenarios=[...],
)
# NEW: Restore question_models if present
if "question_models" in data:
from ..language_models import LanguageModel
job._question_models = {
qname: LanguageModel.from_dict(model_dict)
for qname, model_dict in data["question_models"].items()
}
return job
10. Bucket Collection (Rate Limiting)
Bucket collection manages API rate limits per model.
Current:
def create_bucket_collection(self):
return BucketCollection.from_models(self.models)
With question-specific models:
def create_bucket_collection(self):
# Include all models: defaults + overrides
all_models = set(self.models) | set(self._question_models.values())
return BucketCollection.from_models(all_models)
Location: edsl/jobs/jobs.py:759-784
Summary of Recommendations
| Decision Point | Recommendation | Rationale |
|---|---|---|
| Interview structure | Single interview with routing | Maintains survey coherence, simpler |
| Results representation | Question-level tracking in answers | Backward compatible, detailed when needed |
| Model list management | Keep separate (self.models vs _question_models) | Clear semantics, no interview count confusion |
| Caching | No changes needed | Works correctly as-is |
| API design | Explicit set_question_models() method |
Clear and unambiguous |
| Cost estimation | Update to check per-question models | Accurate cost calculation |
| Validation | Immediate (at set time) | Fast feedback to user |
| Allow override all | Yes, but still require default model | Flexible, but safe |
| Documentation | "Override" framing | Matches existing mental model |
| Serialization | Add question_models to to_dict/from_dict | Persistence support |
| Bucket collection | Include all models (defaults + overrides) | Proper rate limiting |
Implementation Checklist
When implementation begins, modify these files:
Core Implementation
-
[ ]
edsl/jobs/jobs.py- [ ] Add
_question_modelsattribute to__init__ - [ ] Add
set_question_models()method - [ ] Add
set_question_model()convenience method - [ ] Update
create_bucket_collection()to include override models - [ ] Update
to_dict()to serialize question_models - [ ] Update
from_dict()to deserialize question_models
- [ ] Add
-
[ ]
edsl/jobs/jobs_interview_constructor.py- [ ] Pass
question_modelsto Interview constructor
- [ ] Pass
-
[ ]
edsl/interviews/interview.py- [ ] Add
question_modelsparameter to__init__ - [ ] Store as instance attribute
- [ ] Add
-
[ ]
edsl/jobs/fetch_invigilator.py- [ ] Add
get_model_for_question()method - [ ] Update
get_invigilator()to use question-specific model
- [ ] Add
Cost Estimation
- [ ]
edsl/jobs/jobs_pricing_estimation.py- [ ] Update
estimate_job_cost()to check per-question models - [ ] Ensure correct tokenizer used per model
- [ ] Update
Results Tracking
- [ ]
edsl/results/*.py- [ ] Store model info with each answer
- [ ] Add
answer.{question_name}._modelfield - [ ] Update documentation
Testing
- [ ] Unit tests for
set_question_models() - [ ] Integration test: simple override scenario
- [ ] Integration test: all questions overridden
- [ ] Integration test: multiple agents × scenarios × mixed models
- [ ] Test serialization/deserialization
- [ ] Test cost estimation with mixed models
- [ ] Test caching behavior
- [ ] Test bucket collection includes all models
- [ ] Test validation (invalid question names)
Documentation
- [ ] Update Jobs documentation
- [ ] Add usage examples
- [ ] Update cost estimation docs
- [ ] Add FAQ about model selection
- [ ] Update tutorial/cookbook examples
Open Questions for Discussion
-
Default model requirement: Should we require a default model even if all questions have overrides?
- Current thinking: Yes, because Interview still needs a model attribute
-
Results API: Should we add a convenience method like
results.get_model_for_question(question_name)or isresult.answer.{q}._modelsufficient? -
Validation timing: Should we also validate at run() time (in case survey changes), or only at set time?
-
Multiple calls: What happens with:
job.set_question_models({"q1": model_a}) job.set_question_models({"q2": model_b}) # Replaces or merges?Current thinking: Replace (like assignment). Add separate
add_question_model()if merging needed. -
Type hints: Should question_models accept
Union[LanguageModel, str]where str is model name?job.set_question_models({ "q1": "gpt-4", # Automatic Model creation? "q2": Model("claude-3-5-sonnet"), })
Next Steps
- Review this document with stakeholders/team
- Make decisions on open questions
- Create implementation branch
- Write tests first (TDD approach)
- Implement core functionality
- Update documentation
- Code review and iterate
- Merge to main
Related Files and Context
Key Files to Study
edsl/jobs/jobs.py(lines 110-164): Jobs.init and model managementedsl/jobs/jobs_interview_constructor.py(lines 29-92): Interview creationedsl/interviews/interview.py(lines 99-148): Interview.initedsl/jobs/fetch_invigilator.py(lines 67-86): Invigilator creation with modeledsl/invigilators/invigilator_base.py(lines 60-126): How invigilators use models
Architecture Flow
User Code
↓
Jobs.by(model) → stores in self.models
Jobs.set_question_models({...}) → stores in self._question_models
↓
Jobs.run()
↓
Jobs.generate_interviews()
↓
InterviewsConstructor.create_interviews()
↓
Interview.__init__(model, question_models)
↓
Interview.async_conduct_interview()
↓
FetchInvigilator.get_invigilator(question)
↓ [NEW: Check question_models here]
FetchInvigilator.get_model_for_question(question)
↓
InvigilatorBase.__init__(model=selected_model)
↓
Model.async_execute_model_call()
Testing Strategy
Unit Tests:
set_question_models()validationset_question_model()single assignmentget_model_for_question()lookup logic- Serialization round-trip
Integration Tests:
- Full job run with mixed models
- Cost estimation accuracy
- Cache hit/miss behavior
- Results structure with question-level models
Edge Cases:
- Empty question_models
- All questions overridden
- Invalid question names
- Model not in models list
- Multiple calls to set_question_models()
Version History
- 2025-01-23: Initial design document created during planning phase
@johnjosephhorton Can I work on this?