gotests icon indicating copy to clipboard operation
gotests copied to clipboard

E2E tests: qwen2.5-coder:0.5b non-determinism with receiver method instantiation

Open cweill opened this issue 1 month ago • 1 comments

Problem

The E2E tests for calculator_multiply and calculator_divide fail all 10 retry attempts due to non-deterministic receiver method instantiation patterns generated by qwen2.5-coder:0.5b.

Details

Even with temperature=0 and seed=42, the LLM randomly chooses between two valid receiver instantiation patterns:

Pattern 1 (in golden files):

for _, tt := range tests {
    t.Run(tt.name, func(t *testing.T) {
        c := &Calculator{}
        if got := c.Multiply(tt.args.n, tt.args.d); got != tt.want {
            t.Errorf("Calculator.Multiply() = %v, want %v", got, tt.want)
        }
    })
}

Pattern 2 (sometimes generated):

for _, tt := range tests {
    t.Run(tt.name, func(t *testing.T) {
        if got := tt.c.Multiply(tt.args.n, tt.args.d); got != tt.want {
            t.Errorf("Calculator.Multiply() = %v, want %v", got, tt.want)
        }
    })
}

Both patterns are syntactically valid but produce different output strings, causing E2E test failures.

Current Status

  • Temporarily disabled calculator_multiply and calculator_divide E2E tests in internal/ai/e2e_test.go
  • 9/11 E2E tests passing consistently on first attempt
  • 2/11 tests disabled with TODO comment referencing this issue

Possible Solutions

  1. Add normalization logic: Convert Pattern 2 → Pattern 1 before comparison
  2. Strengthen prompt: Add explicit instruction to prefer Pattern 1
  3. Try different LLM: Test with larger/different models (e.g., qwen2.5-coder:1.5b)
  4. Relax matching: Use AST comparison instead of exact string matching (loses determinism validation)
  5. Accept both patterns: Update golden files to include both valid patterns (complex to implement)

References

  • PR #194: Add AI-powered test generation
  • Test failure logs: /tmp/full_e2e.txt
  • E2E test code: internal/ai/e2e_test.go:116-258
  • Golden files: testdata/goldens/calculator_{multiply,divide}_ai.go

cweill avatar Oct 23 '25 04:10 cweill

Update: Non-determinism is broader than receiver methods

Further testing reveals the non-determinism issue affects more than just receiver methods and is environment-dependent.

Additional Failing Tests

Two additional tests that pass locally on macOS but fail in CI (Ubuntu):

  • business_logic_calculate_discount (regular function, not a method)
  • string_utils_reverse (regular function, not a method)

Current Status

Disabled tests (4 out of 11 total):

  1. calculator_multiply (receiver method)
  2. calculator_divide (receiver method)
  3. business_logic_calculate_discount (regular function)
  4. string_utils_reverse (regular function)

Passing tests (7 out of 11 total):

  1. math_ops_clamp
  2. data_processing_filter_positive
  3. user_service_hash_password
  4. string_utils_parse_key_value
  5. string_utils_contains_any
  6. business_logic_format_currency
  7. math_ops_factorial

Test runtime: 5.98s (7/7 passing)

Key Finding: Environment-Dependent Non-Determinism

The non-determinism appears to be environment-dependent rather than purely code-pattern-based:

  • macOS (local): All 4 disabled tests pass consistently on first attempt
  • Ubuntu (CI): Same tests fail all 10 retry attempts

This suggests qwen2.5-coder:0.5b output is influenced by:

  • OS/platform differences (macOS vs Linux)
  • Ollama version differences
  • System libraries or environment variables
  • Hardware differences (Apple Silicon vs x86_64)

Even with temperature=0 and seed=42, these environmental factors cause different outputs.

Impact on Solutions

This finding affects the proposed solutions:

  1. Normalization - May need to handle more diverse patterns than initially expected
  2. Stronger prompt - Less likely to help if the issue is environment-dependent
  3. Different LLM - Larger models may be more stable across environments
  4. AST comparison - Still viable but loses determinism validation
  5. Multiple golden files - Would need separate goldens per environment (not practical)

Recommendation

Consider these approaches:

  • Short term: Keep 7/11 tests enabled, sufficient for E2E validation
  • Medium term: Test with qwen2.5-coder:1.5b or 3b to see if larger models are more stable
  • Long term: Investigate adding normalization for common variations

cweill avatar Oct 23 '25 16:10 cweill