E2E tests: qwen2.5-coder:0.5b non-determinism with receiver method instantiation

Open cweill opened this issue 1 month ago • 1 comments

Problem

The E2E tests for calculator_multiply and calculator_divide fail all 10 retry attempts due to non-deterministic receiver method instantiation patterns generated by qwen2.5-coder:0.5b.

Details

Even with temperature=0 and seed=42, the LLM randomly chooses between two valid receiver instantiation patterns:

Pattern 1 (in golden files):

for _, tt := range tests {
    t.Run(tt.name, func(t *testing.T) {
        c := &Calculator{}
        if got := c.Multiply(tt.args.n, tt.args.d); got != tt.want {
            t.Errorf("Calculator.Multiply() = %v, want %v", got, tt.want)
        }
    })
}

Pattern 2 (sometimes generated):

for _, tt := range tests {
    t.Run(tt.name, func(t *testing.T) {
        if got := tt.c.Multiply(tt.args.n, tt.args.d); got != tt.want {
            t.Errorf("Calculator.Multiply() = %v, want %v", got, tt.want)
        }
    })
}

Both patterns are syntactically valid but produce different output strings, causing E2E test failures.

Current Status

Temporarily disabled calculator_multiply and calculator_divide E2E tests in internal/ai/e2e_test.go
9/11 E2E tests passing consistently on first attempt
2/11 tests disabled with TODO comment referencing this issue

Possible Solutions

Add normalization logic: Convert Pattern 2 → Pattern 1 before comparison
Strengthen prompt: Add explicit instruction to prefer Pattern 1
Try different LLM: Test with larger/different models (e.g., qwen2.5-coder:1.5b)
Relax matching: Use AST comparison instead of exact string matching (loses determinism validation)
Accept both patterns: Update golden files to include both valid patterns (complex to implement)

References

PR #194: Add AI-powered test generation
Test failure logs: /tmp/full_e2e.txt
E2E test code: internal/ai/e2e_test.go:116-258
Golden files: testdata/goldens/calculator_{multiply,divide}_ai.go

Oct 23 '25 04:10 cweill

Update: Non-determinism is broader than receiver methods

Further testing reveals the non-determinism issue affects more than just receiver methods and is environment-dependent.

Additional Failing Tests

Two additional tests that pass locally on macOS but fail in CI (Ubuntu):

business_logic_calculate_discount (regular function, not a method)
string_utils_reverse (regular function, not a method)

Current Status

Disabled tests (4 out of 11 total):

❌ calculator_multiply (receiver method)
❌ calculator_divide (receiver method)
❌ business_logic_calculate_discount (regular function)
❌ string_utils_reverse (regular function)

Passing tests (7 out of 11 total):

✅ math_ops_clamp
✅ data_processing_filter_positive
✅ user_service_hash_password
✅ string_utils_parse_key_value
✅ string_utils_contains_any
✅ business_logic_format_currency
✅ math_ops_factorial

Test runtime: 5.98s (7/7 passing)

Key Finding: Environment-Dependent Non-Determinism

The non-determinism appears to be environment-dependent rather than purely code-pattern-based:

macOS (local): All 4 disabled tests pass consistently on first attempt
Ubuntu (CI): Same tests fail all 10 retry attempts

This suggests qwen2.5-coder:0.5b output is influenced by:

OS/platform differences (macOS vs Linux)
Ollama version differences
System libraries or environment variables
Hardware differences (Apple Silicon vs x86_64)

Even with temperature=0 and seed=42, these environmental factors cause different outputs.

Impact on Solutions

This finding affects the proposed solutions:

Normalization - May need to handle more diverse patterns than initially expected
Stronger prompt - Less likely to help if the issue is environment-dependent
Different LLM - Larger models may be more stable across environments
AST comparison - Still viable but loses determinism validation
Multiple golden files - Would need separate goldens per environment (not practical)

Recommendation

Consider these approaches:

Short term: Keep 7/11 tests enabled, sufficient for E2E validation
Medium term: Test with qwen2.5-coder:1.5b or 3b to see if larger models are more stable
Long term: Investigate adding normalization for common variations

Oct 23 '25 16:10 cweill