E2E tests: qwen2.5-coder:0.5b non-determinism with receiver method instantiation
Problem
The E2E tests for calculator_multiply and calculator_divide fail all 10 retry attempts due to non-deterministic receiver method instantiation patterns generated by qwen2.5-coder:0.5b.
Details
Even with temperature=0 and seed=42, the LLM randomly chooses between two valid receiver instantiation patterns:
Pattern 1 (in golden files):
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
c := &Calculator{}
if got := c.Multiply(tt.args.n, tt.args.d); got != tt.want {
t.Errorf("Calculator.Multiply() = %v, want %v", got, tt.want)
}
})
}
Pattern 2 (sometimes generated):
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
if got := tt.c.Multiply(tt.args.n, tt.args.d); got != tt.want {
t.Errorf("Calculator.Multiply() = %v, want %v", got, tt.want)
}
})
}
Both patterns are syntactically valid but produce different output strings, causing E2E test failures.
Current Status
- Temporarily disabled
calculator_multiplyandcalculator_divideE2E tests in internal/ai/e2e_test.go - 9/11 E2E tests passing consistently on first attempt
- 2/11 tests disabled with TODO comment referencing this issue
Possible Solutions
- Add normalization logic: Convert Pattern 2 → Pattern 1 before comparison
- Strengthen prompt: Add explicit instruction to prefer Pattern 1
- Try different LLM: Test with larger/different models (e.g., qwen2.5-coder:1.5b)
- Relax matching: Use AST comparison instead of exact string matching (loses determinism validation)
- Accept both patterns: Update golden files to include both valid patterns (complex to implement)
References
- PR #194: Add AI-powered test generation
- Test failure logs: /tmp/full_e2e.txt
- E2E test code: internal/ai/e2e_test.go:116-258
- Golden files: testdata/goldens/calculator_{multiply,divide}_ai.go
Update: Non-determinism is broader than receiver methods
Further testing reveals the non-determinism issue affects more than just receiver methods and is environment-dependent.
Additional Failing Tests
Two additional tests that pass locally on macOS but fail in CI (Ubuntu):
business_logic_calculate_discount(regular function, not a method)string_utils_reverse(regular function, not a method)
Current Status
Disabled tests (4 out of 11 total):
- ❌
calculator_multiply(receiver method) - ❌
calculator_divide(receiver method) - ❌
business_logic_calculate_discount(regular function) - ❌
string_utils_reverse(regular function)
Passing tests (7 out of 11 total):
- ✅
math_ops_clamp - ✅
data_processing_filter_positive - ✅
user_service_hash_password - ✅
string_utils_parse_key_value - ✅
string_utils_contains_any - ✅
business_logic_format_currency - ✅
math_ops_factorial
Test runtime: 5.98s (7/7 passing)
Key Finding: Environment-Dependent Non-Determinism
The non-determinism appears to be environment-dependent rather than purely code-pattern-based:
- macOS (local): All 4 disabled tests pass consistently on first attempt
- Ubuntu (CI): Same tests fail all 10 retry attempts
This suggests qwen2.5-coder:0.5b output is influenced by:
- OS/platform differences (macOS vs Linux)
- Ollama version differences
- System libraries or environment variables
- Hardware differences (Apple Silicon vs x86_64)
Even with temperature=0 and seed=42, these environmental factors cause different outputs.
Impact on Solutions
This finding affects the proposed solutions:
- Normalization - May need to handle more diverse patterns than initially expected
- Stronger prompt - Less likely to help if the issue is environment-dependent
- Different LLM - Larger models may be more stable across environments
- AST comparison - Still viable but loses determinism validation
- Multiple golden files - Would need separate goldens per environment (not practical)
Recommendation
Consider these approaches:
- Short term: Keep 7/11 tests enabled, sufficient for E2E validation
- Medium term: Test with qwen2.5-coder:1.5b or 3b to see if larger models are more stable
- Long term: Investigate adding normalization for common variations